Predictive analytics: Complications - Minor
Thoughts
Questions
- What deficit are we talking about? Is it covered by the mRS?
- Should the pre-treatment mRS be in the model? Already covered by the others?
Backlog
- Honest estimation
- G estimation (using only significant ones to get accurate ORs)
- Make sure that the number of outcomes (26) is valid.
- Bootstrap the whole process - how many models include each of the features?
- The SuSiE method: https://www.biorxiv.org/content/10.1101/501114v1
- Choose best model with forward/backward exclusion
- Feature selection with honest estimation + p-value-based + bootstrap
- Feature selection with honest estimation + LASSO-based + bootstrap
- Implement PCA to create independent variables and not drop variables
- Bayesian fitting
- Change to use tidybayes and tidymodels
- Use neural networks to fit
- Consider a time-to-event analysis
Setup
Imports
# Load packages
library(glmnet)
library(magrittr)
library(selectiveInference)
library(tidyverse)
# Data
# https://office365stanford-my.sharepoint.com/:x:/r/personal/simonlev_stanford_edu/_layouts/15/Doc.aspx?sourcedoc=%7BCE7DAF19-8F1A-43B0-866D-3D0F36336EDC%7D&file=Peds%20AVM%20Database%20July%2010%202024%20Onwards.xlsx&fromShare=true&action=default&mobileredirect=true
# Source codeConfigurations
# Source outcome-specific configs
source(params$CONFIG)
# Destination locations
DST_DIRNAME <- paste0("../../outputs/")
# Source locations
SRC_DIRNAME <- "../../data/3_tidy/patients"
SRC_BASENAME <- "patients-daily_2024-07-21.csv"
# Should the Dockerfile be automatically updated?
UPDATE_DOCKERFILE <- FALSE
# Set the seed
set.seed(1891)Parameters
EXPOSURES_CONTINUOUS <- c(
"Age at 1st treatment (years)" = "age_at_first_treatment_yrs",
"mRS (presentation)" = "modified_rankin_score_presentation",
"mRS (pre-treatment)" = "modified_rankin_score_pretreatment",
"mRS (1-week post-op)" = "modified_rankin_score_postop_within_1_week",
"mRS (final)" = "modified_rankin_score_final",
"Nidus size (cm)" = "max_size_cm",
"Spetzler-Martin grade" = "spetzler_martin_grade",
"Lawton-Young grade" = "lawton_young_grade",
"Size score" = "size_score"
)
EXPOSURES_BINARY <- c(
"Sex (Male)" = "is_male",
"Involves eloquent location" = "is_eloquent_location",
"Has deep venous drainage" = "has_deep_venous_drainage",
"Diffuse nidus" = "is_diffuse_nidus",
"Hemorrhage at presentation" = "has_hemorrhage",
"Seizures at presentation" = "has_seizures",
"Deficit at presentation" = "has_deficit",
"Paresis at presentation" = "has_paresis",
"Presence of aneurysms" = "has_associated_aneurysm",
"Spetzler-Martin grade < 4" = "is_spetzler_martin_grade_less_than_4",
"Had surgery" = "is_surgery"
)
EXPOSURES_CATEGORICAL <- c(
"Location" = "location",
"Venous drainage" = "venous_drainage",
"Modality of treatment" = "procedure_combinations"
)
OUTCOMES <- c(
"Poor outcome (mRS >= 3)" = "is_poor_outcome",
"Obliteration" = "is_obliterated",
"Complications - minor" = "has_complication_minor",
"Complications - major" = "has_complication_major",
"mRS change (final - pre-treatment)" =
"modified_rankin_score_final_minus_pretx",
"mRS change (final - presentation)" =
"modified_rankin_score_final_minus_presentation",
"mRS change direction (Final - Pre-tx)" = "change_in_mrs_tx_vs_final"
)Functions
R utils.
Data analysis utils.
Read
# File path
filepath <- file.path(SRC_DIRNAME, SRC_BASENAME)
# Read
patients_ <-
read_csv(filepath) %>%
dplyr::sample_frac(params$PROPORTION_OF_DATA)
# If analyzing a subset of the data, restrict the dataset
if (params$TITLE == "Poor outcome - Hemorrhage") {
patients_ <- patients_ %>% filter(has_hemorrhage)
}
if (params$TITLE == "Poor outcome - No Hemorrhage") {
patients_ <- patients_ %>% filter(!has_hemorrhage)
}Conform
Setup
Remove all empty rows.
Create two dataframes - one for univariable and one for multivariable analysis. This is because variables for multivariable analysis will be recoded to reduce the number of levels to avoid overfitting the model.
Recode columns
For venous_drainage if “Both”, mark as “Deep.”
df_multi <-
df_multi %>%
mutate(
venous_drainage = ifelse(venous_drainage == "Both", "Deep", venous_drainage)
)For procedure_combinations, change > 1 to multimodal
to reduce levels.
df_multi <-
df_multi %>%
mutate(procedure_combinations = case_when(
nchar(procedure_combinations) > 1 ~ "Multimodal",
.default = procedure_combinations
))For procedure_combinations, indicate if surgery-based or
not.
df_multi <-
df_multi %>%
mutate(
is_surgery = str_detect(procedure_combinations, "S")
) %>%
relocate(is_surgery, .after = procedure_combinations)
df_uni <-
df_uni %>%
mutate(
is_surgery = str_detect(procedure_combinations, "S")
) %>%
relocate(is_surgery, .after = procedure_combinations)For location, reduce choices (use “Other”).
df_multi <-
df_multi %>%
mutate(
location = ifelse(location == "Corpus Callosum", "Other", location),
location = ifelse(location == "Cerebellum", "Other", location),
# location = ifelse(location == "Deep", "Other", location),
location = factor(location),
location = relevel(location, ref = "Other")
)For spetzler_martin_grade, create a binary variant of
1-3 vs. 4-5.
# For multivariable analysis
df_multi <-
df_multi %>%
mutate(
is_spetzler_martin_grade_less_than_4 = spetzler_martin_grade < 4
) %>%
relocate(is_spetzler_martin_grade_less_than_4, .after = spetzler_martin_grade)
# For univariable analysis
df_uni <-
df_uni %>%
mutate(
is_spetzler_martin_grade_less_than_4 = spetzler_martin_grade < 4
) %>%
relocate(is_spetzler_martin_grade_less_than_4, .after = spetzler_martin_grade)Missingness
# Get cols
cols <- unname(c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
EXPOSURES_CATEGORICAL,
OUTCOMES
))
# Count missingness
df_multi %>%
select(cols) %>%
summarise_all(~ sum(is.na(.))) %>%
pivot_longer(everything(), values_to = "num_missing") %>%
arrange(desc(num_missing)) %>%
filter(num_missing > 0) %>%
sable()| name | num_missing |
|---|---|
| lawton_young_grade | 10 |
| spetzler_martin_grade | 8 |
| is_spetzler_martin_grade_less_than_4 | 8 |
| size_score | 6 |
| has_associated_aneurysm | 6 |
| max_size_cm | 5 |
| is_eloquent_location | 5 |
| has_deep_venous_drainage | 4 |
| modified_rankin_score_pretreatment | 3 |
| modified_rankin_score_postop_within_1_week | 3 |
| is_surgery | 3 |
| venous_drainage | 3 |
| procedure_combinations | 3 |
| modified_rankin_score_final_minus_pretx | 3 |
| has_complication_minor | 2 |
| has_complication_major | 2 |
| modified_rankin_score_final | 1 |
| is_male | 1 |
| is_diffuse_nidus | 1 |
| location | 1 |
| is_poor_outcome | 1 |
| modified_rankin_score_final_minus_presentation | 1 |
| change_in_mrs_tx_vs_final | 1 |
Which eligible patients are missing each variable?
df_multi %>%
filter(is_eligible) %>%
select(mrn, cols) %>%
mutate(across(-mrn, is.na)) %>%
pivot_longer(
cols = -mrn,
names_to = "name",
values_to = "is_missing"
) %>%
filter(is_missing) %>%
group_by(name) %>%
summarize(mrn = str_c(mrn, collapse = ", ")) %>%
sable()| name | mrn |
|---|---|
| change_in_mrs_tx_vs_final | 60860434 |
| has_associated_aneurysm | 60233426, 60310463, 60303054, 61011961, 62696083, 62609300 |
| has_complication_major | 60860434, 62023254 |
| has_complication_minor | 60860434, 62023254 |
| has_deep_venous_drainage | 60233426, 62609300, 42446310, 81981102 |
| is_diffuse_nidus | 15871379 |
| is_eloquent_location | 62728597, 62696083, 62609300, 42446310, 81981102 |
| is_male | 19399469 |
| is_poor_outcome | 60860434 |
| is_spetzler_martin_grade_less_than_4 | 60233426, 60303054, 60860434, 62728597, 62696083, 62609300, 42446310, 81981102 |
| is_surgery | 49659493, 46166047, 62609300 |
| lawton_young_grade | 60233426, 60310463, 60303054, 60860434, 61011961, 62728597, 62696083, 62609300, 42446310, 81981102 |
| location | 60310463 |
| max_size_cm | 60233426, 60303054, 60860434, 62609300, 81981102 |
| modified_rankin_score_final | 60860434 |
| modified_rankin_score_final_minus_presentation | 60860434 |
| modified_rankin_score_final_minus_pretx | 60860434, 49659493, 46166047 |
| modified_rankin_score_postop_within_1_week | 60860434, 49659493, 46166047 |
| modified_rankin_score_pretreatment | 60860434, 49659493, 46166047 |
| procedure_combinations | 49659493, 46166047, 62609300 |
| size_score | 60233426, 60303054, 60860434, 62609300, 42446310, 81981102 |
| spetzler_martin_grade | 60233426, 60303054, 60860434, 62728597, 62696083, 62609300, 42446310, 81981102 |
| venous_drainage | 60233426, 62609300, 81981102 |
Visualizations
Overall
Distribution of values of the exposures across levels of the outcome.
# Define the predictors of interest
cols <- c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
CHOSEN_OUTCOME
)
# Reshape the data
df_long <-
df_uni %>%
select(all_of(cols)) %>%
pivot_longer(-CHOSEN_OUTCOME, names_to = "predictor", values_to = "value")
# Plot the box plots
df_long %>%
ggplot(aes_string(
x = CHOSEN_OUTCOME,
y = "value",
fill = CHOSEN_OUTCOME,
color = CHOSEN_OUTCOME
)) +
geom_boxplot(alpha = 0.5) +
facet_wrap(~predictor, scales = "free") +
labs(x = "Outcome", y = "Value", fill = "Outcome") +
theme_minimal() +
theme(
axis.text = element_text(size = 10),
axis.title = element_text(size = 11),
strip.text = element_text(size = 11),
legend.position = "none"
)By hemorrhage
# Define the predictors of interest
cols <- c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
CHOSEN_OUTCOME
)
# Reshape the data
df_long <-
df_uni %>%
select(all_of(cols)) %>%
pivot_longer(
cols = -c(CHOSEN_OUTCOME, `Hemorrhage at presentation`),
names_to = "predictor",
values_to = "value"
)
# Plot the box plots
df_long %>%
ggplot(aes_string(
x = CHOSEN_OUTCOME,
y = "value",
fill = "`Hemorrhage at presentation`",
color = "`Hemorrhage at presentation`"
)) +
geom_boxplot(alpha = 0.5) +
facet_wrap(~predictor, scales = "free") +
labs(x = "Outcome", y = "Value", fill = "Outcome") +
theme_minimal() +
guides(fill = "none") +
theme(
axis.text = element_text(size = 10),
axis.title = element_text(size = 11),
strip.text = element_text(size = 11)
)Descriptive statistics
Setup
Descriptive statistics - continuous variables.
compute_descriptive_continuous <- function(
df = df_uni,
cols = c(EXPOSURES_CONTINUOUS),
use_interaction = TRUE,
interaction_col = "has_hemorrhage",
interaction_col_name = "Hemorrhage at presentation",
interaction_col_value = TRUE,
prefix = "Haem") {
# Apply desired formatting to numbers
format_numbers <- function(x) {
sprintf("%.1f", x)
}
# Compute new dataset
df <-
df %>%
select(cols, CHOSEN_OUTCOME, interaction_col) %>%
drop_na(CHOSEN_OUTCOME, interaction_col)
# Apply interaction
if (use_interaction) {
df <-
df %>%
filter(!!sym(interaction_col) == interaction_col_value)
}
# Compute descriptive statistics
desc_stats <-
df %>%
pivot_longer(where(is.numeric), names_to = "keys", values_to = "values") %>%
group_by(!!sym(CHOSEN_OUTCOME), keys) %>%
summarize(
num_total = n(),
num_missing = sum(is.na(values)),
mean = mean(values, na.rm = TRUE),
sd = sd(values, na.rm = TRUE),
min = quantile(values, 0, na.rm = TRUE),
max = quantile(values, 1, na.rm = TRUE),
q_50 = quantile(values, 0.50, na.rm = TRUE),
q_25 = quantile(values, 0.25, na.rm = TRUE),
q_75 = quantile(values, 0.75, na.rm = TRUE)
) %>%
ungroup() %>%
mutate(
stats = glue::glue("{format_numbers(mean)} ({format_numbers(sd)})"),
!!sym(CHOSEN_OUTCOME) := factor(
!!sym(CHOSEN_OUTCOME),
levels = CHOSEN_OUTCOME_LEVELS,
labels = CHOSEN_OUTCOME_LABELS
)
) %>%
pivot_wider(
id_cols = keys,
names_from = !!sym(CHOSEN_OUTCOME),
values_from = stats
) %>%
relocate(CHOSEN_OUTCOME_LABELS[2], .after = CHOSEN_OUTCOME_LABELS[1])
# Compute p-values
pvals <-
df %>%
pivot_longer(where(is.numeric), names_to = "keys", values_to = "values") %>%
group_by(keys) %>%
summarize(
pvalue = wilcox.test(values ~ !!sym(CHOSEN_OUTCOME))$p.value
)
# Synthesize
out <-
desc_stats %>%
left_join(pvals)
# Add sample size to column names
num_with_outcome <- sum(df %>% pull(CHOSEN_OUTCOME))
num_no_outcome <- sum(!df %>% pull(CHOSEN_OUTCOME))
new_col_names <- c(
glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[2]} (n={num_with_outcome})"),
glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[1]} (n={num_no_outcome})"),
glue::glue("{prefix} - P-value")
)
out <-
out %>%
rename_with(
.cols = !!sym(CHOSEN_OUTCOME_LABELS[1]):pvalue, ~new_col_names
) %>%
mutate(
keys = ifelse(keys == interaction_col, interaction_col_name, keys)
)
return(out)
}Descriptive statistics - binary variables.
compute_descriptive_binary <- function(
df = df_uni,
cols = c(EXPOSURES_BINARY),
use_interaction = TRUE,
interaction_col = "has_hemorrhage",
interaction_col_name = "Hemorrhage at presentation",
interaction_col_value = TRUE,
prefix = "Haem") {
# Define arguments
cols <- c(cols[!cols == interaction_col])
# Apply desired formatting to numbers
format_numbers <- function(x, decimals = 0) {
decimals <- paste0("%.", decimals, "f")
sprintf(decimals, x)
}
# Compute new dataset
df <-
df %>%
select(all_of(cols), CHOSEN_OUTCOME, interaction_col) %>%
drop_na(CHOSEN_OUTCOME, interaction_col)
# Apply interaction
if (use_interaction) {
df <-
df %>%
filter(!!sym(interaction_col) == interaction_col_value)
}
# Compute descriptive statistics
desc_stats <-
df %>%
pivot_longer(
cols = -c(CHOSEN_OUTCOME),
names_to = "keys",
values_to = "values"
) %>%
group_by(!!sym(CHOSEN_OUTCOME), keys) %>%
summarize(
num_total = n(),
num_missing = sum(is.na(values)),
num_with = sum(values, na.rm = TRUE),
num_without = sum(!values, na.rm = TRUE),
pct_with = mean(values, na.rm = TRUE)
) %>%
ungroup() %>%
mutate(
stats = glue::glue("{num_with} ({format_numbers(pct_with*100)}%)"),
!!sym(CHOSEN_OUTCOME) := factor(
!!sym(CHOSEN_OUTCOME),
levels = CHOSEN_OUTCOME_LEVELS,
labels = CHOSEN_OUTCOME_LABELS
)
) %>%
pivot_wider(
id_cols = keys,
names_from = !!sym(CHOSEN_OUTCOME),
values_from = stats
) %>%
relocate(CHOSEN_OUTCOME_LABELS[1], .after = CHOSEN_OUTCOME_LABELS[2])
# Compute p-values
pvals <-
df %>%
pivot_longer(
cols = -c(CHOSEN_OUTCOME),
names_to = "keys",
values_to = "values"
) %>%
group_by(keys) %>%
summarize(
pvalue = kruskal.test(values ~ !!sym(CHOSEN_OUTCOME))$p.value
)
# Synthesize
out <-
desc_stats %>%
left_join(pvals)
# Prettify
num_with_outcome <- sum(df %>% pull(CHOSEN_OUTCOME))
num_no_outcome <- sum(!df %>% pull(CHOSEN_OUTCOME))
new_col_names <- c(
glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[2]} (n={num_with_outcome})"),
glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[1]} (n={num_no_outcome})"),
glue::glue("{prefix} - P-value")
)
out <-
out %>%
rename_with(
.cols = !!sym(CHOSEN_OUTCOME_LABELS[2]):pvalue, ~new_col_names
) %>%
mutate(
keys = ifelse(keys == interaction_col, interaction_col_name, keys)
)
return(out)
}Non-specific
# Rename dataframe
`Pediatric AVMs` <-
df_uni %>%
filter(is_eligible) %>%
select(-is_eligible, -comments)
# Create summary
`Pediatric AVMs` %>%
summarytools::dfSummary(display.labels = FALSE) %>%
print(
file =
"../../outputs/descriptive-statistics/descriptive_statistics_eligible.html",
footnote = NA
)
# Remove unwanted dataframe
remove(`Pediatric AVMs`)By hemorrhage - table1 (WIP)
https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html
library(table1)
table1::table1(
~ factor(is_male) + age_at_first_treatment_yrs + factor(is_eloquent_location) + max_size_cm | has_hemorrhage,
data = df_uni
)# Initialize values
df <- df_uni
# Remove missing values in stratification variables
df <-
df %>%
drop_na(has_hemorrhage, CHOSEN_OUTCOME)
# Define columns
cols <- c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
OUTCOMES[OUTCOMES == CHOSEN_OUTCOME]
)
# Assign labels to the variables in the dataframe
for (var in cols) {
label(df[[var]]) <- names(cols[cols == var])
}
# Assign meaningful labels to the has_hemorrhage variable
df$has_hemorrhage <- factor(
df$has_hemorrhage,
levels = c(FALSE, TRUE),
labels = c("No Hemorrhage", "Has Hemorrhage")
)
df[, CHOSEN_OUTCOME] <- factor(
df %>% pull(CHOSEN_OUTCOME),
levels = CHOSEN_OUTCOME_LEVELS,
labels = CHOSEN_OUTCOME_LABELS
)
# Define custom rendering functions if needed
render_continuous <- function(x) {
with(stats.apply.rounding(stats.default(x), digits = 2), c(
"Mean (SD)" = sprintf("%s (± %s)", MEAN, SD),
"Median (IQR)" = sprintf("%s (%s - %s)", MEDIAN, Q1, Q3)
))
}
compute_pvalues <- function(x, ...) {
# Construct vectors of data y, and groups (strata) g
y <- unlist(x)
g <- factor(rep(1:length(x), times = sapply(x, length)))
if (is.numeric(y)) {
# For numeric variables, perform a standard 2-sample t-test
p <- t.test(y ~ g)$p.value
} else {
# For categorical variables, perform a chi-squared test of independence
p <- chisq.test(table(y, g))$p.value
}
# Format the p-value, using an HTML entity for the less-than sign.
# The initial empty string places the output on the line below the variable label.
c("", sub("<", "<", format.pval(p, digits = 3, eps = 0.001)))
}
# Create the descriptive table
frla <- as.formula(paste("~ . | has_hemorrhage *", CHOSEN_OUTCOME))
# Create table
table1_descriptive_statistics <-
table1(frla,
data = df[, unname(cols)],
render.continuous = render_continuous,
overall = c(left = "Overall"),
topclass = "Rtable1-zebra"
)
# Print
table1_descriptive_statisticsBy hemorrhage
# Define arguments
interaction_col <- "has_hemorrhage"
interaction_col_name <- "Hemorrhage at presentation"
prefix_true <- "Haem"
prefix_false <- "No Haem"
# Get all
out <- list()
out[[1]] <- compute_descriptive_continuous(
use_interaction = F,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = T,
prefix = "All"
)
out[[2]] <- compute_descriptive_binary(
use_interaction = F,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = T,
prefix = "All"
)
out_all <- bind_rows(out)
# Get with hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
use_interaction = T,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = T,
prefix = prefix_true
)
out[[2]] <- compute_descriptive_binary(
use_interaction = T,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = T,
prefix = prefix_true
)
out_haem <- bind_rows(out)
# Get without hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
use_interaction = T,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = F,
prefix = prefix_false
)
out[[2]] <- compute_descriptive_binary(
use_interaction = T,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = F,
prefix = prefix_false
)
out_no_haem <- bind_rows(out)
# Synthesize
out_complete <-
out_all %>%
left_join(out_haem, by = "keys") %>%
left_join(out_no_haem, by = "keys") %>%
arrange(`All - P-value`)
# Print
out_complete %>%
sable()| keys | All - Minor complications (n=51) | All - No minor complications (n=183) | All - P-value | Haem - Minor complications (n=28) | Haem - No minor complications (n=114) | Haem - P-value | No Haem - Minor complications (n=23) | No Haem - No minor complications (n=69) | No Haem - P-value |
|---|---|---|---|---|---|---|---|---|---|
| Lawton-Young grade | 4.2 (1.6) | 5.0 (1.5) | 0.002 | 3.8 (1.4) | 4.3 (1.4) | 0.075 | 4.9 (1.5) | 5.9 (1.3) | 0.004 |
| Spetzler-Martin grade | 2.7 (1.2) | 3.3 (1.2) | 0.002 | 2.6 (1.2) | 2.9 (1.1) | 0.123 | 2.9 (1.2) | 3.8 (1.2) | 0.005 |
| Nidus size (cm) | 3.3 (2.3) | 4.0 (2.1) | 0.004 | 2.9 (2.2) | 3.1 (1.6) | 0.201 | 3.8 (2.5) | 5.2 (2.1) | 0.005 |
| Spetzler-Martin grade < 4 | 26 (53%) | 132 (74%) | 0.005 | 19 (70%) | 86 (79%) | 0.346 | 7 (32%) | 46 (67%) | 0.004 |
| Size score | 1.6 (0.7) | 1.9 (0.8) | 0.010 | 1.5 (0.7) | 1.6 (0.6) | 0.320 | 1.8 (0.7) | 2.3 (0.7) | 0.008 |
| Age at 1st treatment (years) | 11.0 (3.9) | 12.2 (3.7) | 0.055 | 10.8 (3.5) | 11.2 (3.7) | 0.428 | 11.3 (4.4) | 13.3 (3.4) | 0.056 |
| Has deep venous drainage | 33 (66%) | 94 (52%) | 0.084 | 21 (75%) | 63 (57%) | 0.079 | 12 (55%) | 31 (45%) | 0.434 |
| Involves eloquent location | 36 (71%) | 104 (58%) | 0.117 | 17 (61%) | 59 (54%) | 0.533 | 19 (83%) | 45 (65%) | 0.118 |
| Diffuse nidus | 10 (20%) | 21 (12%) | 0.135 | 6 (21%) | 10 (9%) | 0.059 | 4 (17%) | 11 (16%) | 0.893 |
| mRS (1-week post-op) | 1.6 (1.3) | 1.9 (1.4) | 0.146 | 1.9 (1.4) | 2.0 (1.3) | 0.765 | 1.1 (1.0) | 1.8 (1.4) | 0.036 |
| Sex (Male) | 31 (61%) | 93 (51%) | 0.222 | 16 (57%) | 60 (53%) | 0.669 | 15 (65%) | 33 (49%) | 0.168 |
| Seizures at presentation | 16 (31%) | 43 (23%) | 0.253 | 7 (25%) | 25 (22%) | 0.728 | 9 (39%) | 18 (26%) | 0.237 |
| Deficit at presentation | 22 (43%) | 64 (35%) | 0.286 | 11 (39%) | 41 (36%) | 0.745 | 11 (48%) | 23 (33%) | 0.215 |
| Hemorrhage at presentation | 28 (55%) | 114 (62%) | 0.340 | 28 (100%) | 114 (100%) | 0 (0%) | 0 (0%) | ||
| mRS (presentation) | 2.4 (1.8) | 2.0 (1.6) | 0.353 | 3.2 (1.7) | 2.6 (1.8) | 0.108 | 1.0 (1.0) | 1.3 (1.0) | 0.073 |
| mRS (final) | 1.0 (1.3) | 1.0 (1.2) | 0.450 | 1.0 (1.3) | 1.2 (0.9) | 0.099 | 0.9 (1.2) | 0.9 (1.4) | 0.569 |
| Had surgery | 32 (63%) | 105 (58%) | 0.572 | 18 (64%) | 75 (68%) | 0.742 | 14 (61%) | 30 (43%) | 0.150 |
| mRS (pre-treatment) | 1.6 (1.5) | 1.6 (1.3) | 0.665 | 2.1 (1.6) | 1.8 (1.4) | 0.515 | 0.8 (0.8) | 1.3 (1.1) | 0.049 |
| Paresis at presentation | 14 (27%) | 45 (25%) | 0.678 | 10 (36%) | 31 (27%) | 0.374 | 4 (17%) | 14 (20%) | 0.763 |
| Presence of aneurysms | 10 (21%) | 36 (20%) | 0.899 | 8 (30%) | 26 (23%) | 0.488 | 2 (10%) | 10 (15%) | 0.546 |
By diffuse nidus
# Define arguments
interaction_col <- "is_diffuse_nidus"
interaction_col_name <- "Is diffuse nidus"
prefix_true <- "Diffuse"
prefix_false <- "Not Diffuse"
# Get all
out <- list()
out[[1]] <- compute_descriptive_continuous(
use_interaction = F,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = T,
prefix = "All"
)
out[[2]] <- compute_descriptive_binary(
use_interaction = F,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = T,
prefix = "All"
)
out_all <- bind_rows(out)
# Get with hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
use_interaction = T,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = T,
prefix = prefix_true
)
out[[2]] <- compute_descriptive_binary(
use_interaction = T,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = T,
prefix = prefix_true
)
out_haem <- bind_rows(out)
# Get without hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
use_interaction = T,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = F,
prefix = prefix_false
)
out[[2]] <- compute_descriptive_binary(
use_interaction = T,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = F,
prefix = prefix_false
)
out_no_haem <- bind_rows(out)
# Synthesize
out_complete <-
out_all %>%
left_join(out_haem, by = "keys") %>%
left_join(out_no_haem, by = "keys") %>%
arrange(`All - P-value`)
# Print
out_complete %>%
sable()| keys | All - Minor complications (n=51) | All - No minor complications (n=182) | All - P-value | Diffuse - Minor complications (n=10) | Diffuse - No minor complications (n=21) | Diffuse - P-value | Not Diffuse - Minor complications (n=41) | Not Diffuse - No minor complications (n=161) | Not Diffuse - P-value |
|---|---|---|---|---|---|---|---|---|---|
| Lawton-Young grade | 4.2 (1.6) | 5.0 (1.5) | 0.002 | 6.8 (1.2) | 6.5 (1.0) | 0.280 | 3.9 (1.2) | 4.6 (1.4) | 0.005 |
| Spetzler-Martin grade | 2.7 (1.2) | 3.3 (1.2) | 0.002 | 4.5 (0.8) | 4.0 (0.8) | 0.090 | 2.5 (1.1) | 3.1 (1.2) | 0.002 |
| Nidus size (cm) | 3.3 (2.3) | 4.0 (2.1) | 0.005 | 6.7 (3.3) | 5.6 (2.4) | 0.420 | 2.8 (1.7) | 3.6 (1.8) | 0.007 |
| Spetzler-Martin grade < 4 | 26 (53%) | 131 (74%) | 0.005 | 3 (30%) | 2 (10%) | 0.154 | 23 (59%) | 129 (83%) | 0.001 |
| Size score | 1.6 (0.7) | 1.9 (0.8) | 0.011 | 2.6 (0.7) | 2.3 (0.7) | 0.229 | 1.5 (0.6) | 1.8 (0.8) | 0.011 |
| Age at 1st treatment (years) | 11.0 (3.9) | 12.2 (3.7) | 0.061 | 10.9 (3.9) | 9.8 (3.5) | 0.565 | 11.1 (3.9) | 12.8 (3.5) | 0.008 |
| Has deep venous drainage | 33 (66%) | 94 (53%) | 0.091 | 9 (90%) | 18 (86%) | 0.743 | 24 (60%) | 76 (48%) | 0.180 |
| Involves eloquent location | 36 (71%) | 103 (58%) | 0.111 | 8 (80%) | 21 (100%) | 0.037 | 28 (68%) | 82 (53%) | 0.072 |
| Is diffuse nidus | 10 (20%) | 21 (12%) | 0.135 | 10 (100%) | 21 (100%) | 0 (0%) | 0 (0%) | ||
| mRS (1-week post-op) | 1.6 (1.4) | 1.9 (1.4) | 0.141 | 1.9 (1.4) | 2.1 (1.4) | 0.581 | 1.6 (1.3) | 1.9 (1.4) | 0.218 |
| Sex (Male) | 31 (61%) | 93 (51%) | 0.235 | 5 (50%) | 11 (52%) | 0.903 | 26 (63%) | 82 (51%) | 0.164 |
| Seizures at presentation | 16 (31%) | 43 (24%) | 0.262 | 5 (50%) | 6 (29%) | 0.252 | 11 (27%) | 37 (23%) | 0.606 |
| Deficit at presentation | 22 (43%) | 63 (35%) | 0.265 | 8 (80%) | 10 (48%) | 0.093 | 14 (34%) | 53 (33%) | 0.882 |
| Hemorrhage at presentation | 28 (55%) | 114 (63%) | 0.318 | 6 (60%) | 10 (48%) | 0.526 | 22 (54%) | 104 (65%) | 0.198 |
| mRS (presentation) | 2.4 (1.8) | 2.0 (1.6) | 0.359 | 1.9 (1.6) | 2.2 (1.5) | 0.485 | 2.4 (1.8) | 2.0 (1.7) | 0.231 |
| mRS (final) | 1.0 (1.3) | 1.0 (1.2) | 0.474 | 2.1 (1.8) | 1.3 (1.1) | 0.306 | 0.9 (1.1) | 1.0 (1.2) | 0.427 |
| Had surgery | 32 (63%) | 105 (59%) | 0.601 | 9 (90%) | 6 (29%) | 0.002 | 23 (56%) | 99 (63%) | 0.443 |
| Paresis at presentation | 14 (27%) | 44 (24%) | 0.633 | 6 (60%) | 6 (29%) | 0.099 | 8 (20%) | 38 (24%) | 0.578 |
| mRS (pre-treatment) | 1.6 (1.5) | 1.6 (1.3) | 0.644 | 1.5 (1.4) | 2.3 (1.5) | 0.104 | 1.6 (1.5) | 1.4 (1.2) | 0.783 |
| Presence of aneurysms | 10 (21%) | 35 (20%) | 0.844 | 2 (20%) | 8 (38%) | 0.322 | 8 (21%) | 27 (17%) | 0.568 |
Associations
Correlation matrix.
# Create new dataframe
df_ <-
df_uni %>%
select(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
EXPOSURES_CATEGORICAL,
OUTCOMES
)
# Convert the outcome variable to numeric for correlation calculation
df_ <-
df_ %>%
select(where(is.numeric), where(is.logical)) %>%
mutate(across(everything(), as.numeric))
# Compute the correlation matrix
cor_matrix <- cor(df_, use = "complete.obs", method = "spearman")
# Plot the correlation matrix
ggcorrplot::ggcorrplot(
cor_matrix,
method = "circle",
lab = TRUE,
lab_size = 2,
colors = c("red", "white", "green4"),
title = "Correlation Matrix",
hc.order = TRUE,
) +
theme(
axis.text.x = element_text(size = 8), # Adjust x-axis text size
axis.text.y = element_text(size = 8) # Adjust y-axis text size
)Univariable statistics
Setup
Define function.
fit_model <- function(
df = df_uni, cols = cols, is_sandwich = T) {
# Initialize
out <- list()
for (i in seq_along(cols)) {
# Create formula
model <- as.formula(paste(CHOSEN_OUTCOME, "~", cols[i]))
fit <- suppressWarnings(glm(
model,
data = df,
family = binomial()
))
if (is_sandwich) {
# Calculate robust standard errors (sandwich)
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
fit_results <- lmtest::coeftest(fit, vcov. = robust_se)
} else {
fit_results <- fit
}
# Summarize model coefficients
fit_summary <-
fit_results %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
out[[i]] <- stylized
}
return(out)
}Unadjusted - Original
Fit logistic.
# Define cols
cols <- unname(c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
EXPOSURES_CATEGORICAL
))
# Fit model
out <- fit_model(df = df_uni, cols = cols, is_sandwich = T)
# Create table
univariable_unadjusted <-
out %>%
bind_rows() %>%
filter(Predictors != "(Intercept)") %>%
filter(`CI (high)` < 50) %>%
arrange(`P-values`)
# Print
univariable_unadjusted %>%
sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| locationOther | 0.00 | 0.72 | -21.402 | 0.000 | 0.00 | 0.00 |
| locationHemispheric | 0.21 | 0.49 | -3.167 | 0.002 | 0.08 | 0.55 |
| lawton_young_grade | 1.35 | 0.10 | 2.900 | 0.004 | 1.10 | 1.65 |
| spetzler_martin_grade | 1.47 | 0.14 | 2.821 | 0.005 | 1.13 | 1.93 |
| is_spetzler_martin_grade_less_than_4TRUE | 0.39 | 0.33 | -2.793 | 0.005 | 0.20 | 0.76 |
| size_score | 1.73 | 0.21 | 2.559 | 0.011 | 1.14 | 2.63 |
| procedure_combinationsR | 0.11 | 1.01 | -2.222 | 0.026 | 0.01 | 0.77 |
| procedure_combinationsS | 0.13 | 0.98 | -2.049 | 0.040 | 0.02 | 0.92 |
| max_size_cm | 1.14 | 0.06 | 2.010 | 0.044 | 1.00 | 1.29 |
| procedure_combinationsRS | 0.13 | 1.11 | -1.876 | 0.061 | 0.01 | 1.10 |
| age_at_first_treatment_yrs | 1.09 | 0.04 | 1.864 | 0.062 | 1.00 | 1.19 |
| procedure_combinationsES | 0.16 | 1.01 | -1.817 | 0.069 | 0.02 | 1.15 |
| has_deep_venous_drainageTRUE | 1.78 | 0.33 | 1.721 | 0.085 | 0.92 | 3.42 |
| procedure_combinationsER | 0.19 | 0.98 | -1.691 | 0.091 | 0.03 | 1.30 |
| locationDeep | 0.42 | 0.51 | -1.669 | 0.095 | 0.16 | 1.16 |
| is_eloquent_locationTRUE | 1.71 | 0.34 | 1.561 | 0.119 | 0.87 | 3.34 |
| is_diffuse_nidusTRUE | 1.87 | 0.42 | 1.483 | 0.138 | 0.82 | 4.28 |
| venous_drainageSuperficial | 0.53 | 0.44 | -1.453 | 0.146 | 0.22 | 1.25 |
| modified_rankin_score_postop_within_1_week | 1.17 | 0.11 | 1.420 | 0.156 | 0.94 | 1.45 |
| modified_rankin_score_presentation | 0.89 | 0.09 | -1.252 | 0.211 | 0.75 | 1.07 |
| is_maleTRUE | 1.48 | 0.32 | 1.221 | 0.222 | 0.79 | 2.79 |
| has_seizuresTRUE | 1.49 | 0.35 | 1.141 | 0.254 | 0.75 | 2.95 |
| has_deficitTRUE | 1.41 | 0.32 | 1.067 | 0.286 | 0.75 | 2.65 |
| has_hemorrhageTRUE | 0.74 | 0.32 | -0.954 | 0.340 | 0.39 | 1.38 |
| locationCorpus Callosum | 0.60 | 0.83 | -0.618 | 0.537 | 0.12 | 3.03 |
| is_surgeryTRUE | 1.20 | 0.33 | 0.566 | 0.572 | 0.63 | 2.28 |
| procedure_combinationsERS | 0.58 | 0.99 | -0.555 | 0.579 | 0.08 | 4.01 |
| has_paresisTRUE | 1.16 | 0.36 | 0.416 | 0.678 | 0.58 | 2.34 |
| modified_rankin_score_final | 1.03 | 0.12 | 0.270 | 0.787 | 0.82 | 1.30 |
| modified_rankin_score_pretreatment | 0.99 | 0.10 | -0.146 | 0.884 | 0.81 | 1.20 |
| has_associated_aneurysmTRUE | 1.05 | 0.40 | 0.128 | 0.898 | 0.48 | 2.31 |
| venous_drainageDeep | 0.95 | 0.43 | -0.113 | 0.910 | 0.41 | 2.22 |
Adjusted - Original
Fit logistic.
cols <- unname(c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
EXPOSURES_CATEGORICAL
))
# Initialize
out <- list()
df <- df_uni
for (i in seq_along(cols)) {
# Create formula
model <- as.formula(paste(
CHOSEN_OUTCOME,
"~",
"age_at_first_treatment_yrs + is_male +",
cols[i]
))
fit <- suppressWarnings(glm(
model,
data = df,
family = binomial()
))
# Calculate robust standard errors (sandwich)
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)
# Summarize model coefficients
fit_summary <-
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
out[[i]] <- stylized
}Print results.
# Create table
univariable_adjusted <-
out %>%
bind_rows() %>%
filter(Predictors != "(Intercept)") %>%
filter(!str_starts(Predictors, "age_at_first_treatment_yrs|is_male")) %>%
filter(`CI (high)` < 50) %>%
arrange(`P-values`)
# Print
univariable_adjusted %>%
sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| locationOther | 0.00 | 0.86 | -18.116 | 0.000 | 0.00 | 0.00 |
| locationHemispheric | 0.20 | 0.52 | -3.145 | 0.002 | 0.07 | 0.54 |
| is_spetzler_martin_grade_less_than_4TRUE | 0.38 | 0.34 | -2.840 | 0.005 | 0.20 | 0.74 |
| spetzler_martin_grade | 1.46 | 0.14 | 2.750 | 0.006 | 1.12 | 1.92 |
| lawton_young_grade | 1.34 | 0.11 | 2.760 | 0.006 | 1.09 | 1.65 |
| size_score | 1.69 | 0.22 | 2.434 | 0.015 | 1.11 | 2.58 |
| procedure_combinationsR | 0.12 | 1.00 | -2.119 | 0.034 | 0.02 | 0.85 |
| max_size_cm | 1.14 | 0.06 | 2.058 | 0.040 | 1.01 | 1.30 |
| procedure_combinationsS | 0.15 | 0.97 | -1.972 | 0.049 | 0.02 | 0.99 |
| procedure_combinationsES | 0.17 | 1.00 | -1.770 | 0.077 | 0.02 | 1.21 |
| procedure_combinationsRS | 0.14 | 1.13 | -1.719 | 0.086 | 0.02 | 1.31 |
| is_eloquent_locationTRUE | 1.77 | 0.34 | 1.668 | 0.095 | 0.90 | 3.48 |
| procedure_combinationsER | 0.20 | 0.98 | -1.652 | 0.098 | 0.03 | 1.35 |
| is_diffuse_nidusTRUE | 2.07 | 0.45 | 1.628 | 0.103 | 0.86 | 4.97 |
| modified_rankin_score_postop_within_1_week | 1.19 | 0.11 | 1.619 | 0.105 | 0.96 | 1.48 |
| has_deep_venous_drainageTRUE | 1.70 | 0.34 | 1.583 | 0.113 | 0.88 | 3.29 |
| locationDeep | 0.43 | 0.54 | -1.574 | 0.116 | 0.15 | 1.23 |
| venous_drainageSuperficial | 0.53 | 0.45 | -1.436 | 0.151 | 0.22 | 1.26 |
| has_deficitTRUE | 1.52 | 0.33 | 1.267 | 0.205 | 0.80 | 2.90 |
| has_seizuresTRUE | 1.47 | 0.35 | 1.089 | 0.276 | 0.74 | 2.93 |
| modified_rankin_score_presentation | 0.92 | 0.09 | -0.957 | 0.339 | 0.78 | 1.09 |
| has_hemorrhageTRUE | 0.78 | 0.32 | -0.780 | 0.436 | 0.41 | 1.46 |
| has_paresisTRUE | 1.32 | 0.37 | 0.750 | 0.453 | 0.64 | 2.73 |
| is_surgeryTRUE | 1.21 | 0.33 | 0.568 | 0.570 | 0.63 | 2.30 |
| locationCorpus Callosum | 0.63 | 0.83 | -0.554 | 0.580 | 0.12 | 3.21 |
| modified_rankin_score_final | 1.06 | 0.11 | 0.550 | 0.582 | 0.85 | 1.33 |
| procedure_combinationsERS | 0.62 | 0.98 | -0.493 | 0.622 | 0.09 | 4.20 |
| venous_drainageDeep | 0.90 | 0.44 | -0.231 | 0.818 | 0.38 | 2.14 |
| modified_rankin_score_pretreatment | 1.02 | 0.10 | 0.193 | 0.847 | 0.84 | 1.24 |
| has_associated_aneurysmTRUE | 1.06 | 0.40 | 0.144 | 0.885 | 0.48 | 2.34 |
Unadjusted - Recoded
Fit logistic.
# Define columns
cols <- unname(c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
EXPOSURES_CATEGORICAL
))
# Initialize
out <- list()
df <- df_multi
for (i in seq_along(cols)) {
# Create formula
model <- as.formula(paste(CHOSEN_OUTCOME, "~", cols[i]))
fit <- suppressWarnings(glm(
model,
data = df,
family = binomial()
))
# Calculate robust standard errors (sandwich)
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)
# Summarize model coefficients
fit_summary <-
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
out[[i]] <- stylized
}Print results.
# Create table
univariable_unadjusted_recoded <-
out %>%
bind_rows() %>%
filter(Predictors != "(Intercept)") %>%
filter(`CI (high)` < 50) %>%
arrange(`P-values`)
# Print
univariable_unadjusted_recoded %>%
sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| locationHemispheric | 0.28 | 0.43 | -2.945 | 0.003 | 0.12 | 0.66 |
| lawton_young_grade | 1.35 | 0.10 | 2.900 | 0.004 | 1.10 | 1.65 |
| spetzler_martin_grade | 1.47 | 0.14 | 2.821 | 0.005 | 1.13 | 1.93 |
| is_spetzler_martin_grade_less_than_4TRUE | 0.39 | 0.33 | -2.793 | 0.005 | 0.20 | 0.76 |
| size_score | 1.73 | 0.21 | 2.559 | 0.011 | 1.14 | 2.63 |
| procedure_combinationsR | 0.11 | 1.01 | -2.222 | 0.026 | 0.01 | 0.77 |
| procedure_combinationsS | 0.13 | 0.98 | -2.049 | 0.040 | 0.02 | 0.92 |
| max_size_cm | 1.14 | 0.06 | 2.010 | 0.044 | 1.00 | 1.29 |
| age_at_first_treatment_yrs | 1.09 | 0.04 | 1.864 | 0.062 | 1.00 | 1.19 |
| venous_drainageSuperficial | 0.54 | 0.33 | -1.823 | 0.068 | 0.28 | 1.05 |
| has_deep_venous_drainageTRUE | 1.78 | 0.33 | 1.721 | 0.085 | 0.92 | 3.42 |
| procedure_combinationsMultimodal | 0.23 | 0.93 | -1.565 | 0.118 | 0.04 | 1.45 |
| is_eloquent_locationTRUE | 1.71 | 0.34 | 1.561 | 0.119 | 0.87 | 3.34 |
| is_diffuse_nidusTRUE | 1.87 | 0.42 | 1.483 | 0.138 | 0.82 | 4.28 |
| modified_rankin_score_postop_within_1_week | 1.17 | 0.11 | 1.420 | 0.156 | 0.94 | 1.45 |
| modified_rankin_score_presentation | 0.89 | 0.09 | -1.252 | 0.211 | 0.75 | 1.07 |
| locationDeep | 0.57 | 0.45 | -1.236 | 0.217 | 0.24 | 1.39 |
| is_maleTRUE | 1.48 | 0.32 | 1.221 | 0.222 | 0.79 | 2.79 |
| has_seizuresTRUE | 1.49 | 0.35 | 1.141 | 0.254 | 0.75 | 2.95 |
| is_surgeryTRUE | 0.64 | 0.41 | -1.089 | 0.276 | 0.29 | 1.42 |
| has_deficitTRUE | 1.41 | 0.32 | 1.067 | 0.286 | 0.75 | 2.65 |
| has_hemorrhageTRUE | 0.74 | 0.32 | -0.954 | 0.340 | 0.39 | 1.38 |
| has_paresisTRUE | 1.16 | 0.36 | 0.416 | 0.678 | 0.58 | 2.34 |
| modified_rankin_score_final | 1.03 | 0.12 | 0.270 | 0.787 | 0.82 | 1.30 |
| modified_rankin_score_pretreatment | 0.99 | 0.10 | -0.146 | 0.884 | 0.81 | 1.20 |
| has_associated_aneurysmTRUE | 1.05 | 0.40 | 0.128 | 0.898 | 0.48 | 2.31 |
Adjusted - Recoded
Fit logistic.
# Define columns
cols <- unname(c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
EXPOSURES_CATEGORICAL
))
# Initialize
out <- list()
df <- df_multi
for (i in seq_along(cols)) {
# Create formula
model <- as.formula(paste(
CHOSEN_OUTCOME,
"~",
"age_at_first_treatment_yrs + is_male +",
cols[i]
))
fit <- suppressWarnings(glm(
model,
data = df,
family = binomial()
))
# Calculate robust standard errors (sandwich)
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)
# Summarize model coefficients
fit_summary <-
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
out[[i]] <- stylized
}Print results.
# Create table
univariable_adjusted_recoded <-
out %>%
bind_rows() %>%
filter(Predictors != "(Intercept)") %>%
filter(!str_starts(Predictors, "age_at_first_treatment_yrs|is_male")) %>%
filter(`CI (high)` < 50) %>%
arrange(`P-values`)
# Print
univariable_adjusted_recoded %>%
sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| locationHemispheric | 0.25 | 0.44 | -3.100 | 0.002 | 0.11 | 0.60 |
| is_spetzler_martin_grade_less_than_4TRUE | 0.38 | 0.34 | -2.840 | 0.005 | 0.20 | 0.74 |
| spetzler_martin_grade | 1.46 | 0.14 | 2.750 | 0.006 | 1.12 | 1.92 |
| lawton_young_grade | 1.34 | 0.11 | 2.760 | 0.006 | 1.09 | 1.65 |
| size_score | 1.69 | 0.22 | 2.434 | 0.015 | 1.11 | 2.58 |
| procedure_combinationsR | 0.12 | 1.00 | -2.121 | 0.034 | 0.02 | 0.85 |
| max_size_cm | 1.14 | 0.06 | 2.058 | 0.040 | 1.01 | 1.30 |
| procedure_combinationsS | 0.15 | 0.97 | -1.976 | 0.048 | 0.02 | 0.98 |
| venous_drainageSuperficial | 0.56 | 0.34 | -1.702 | 0.089 | 0.29 | 1.09 |
| is_eloquent_locationTRUE | 1.77 | 0.34 | 1.668 | 0.095 | 0.90 | 3.48 |
| is_diffuse_nidusTRUE | 2.07 | 0.45 | 1.628 | 0.103 | 0.86 | 4.97 |
| modified_rankin_score_postop_within_1_week | 1.19 | 0.11 | 1.619 | 0.105 | 0.96 | 1.48 |
| has_deep_venous_drainageTRUE | 1.70 | 0.34 | 1.583 | 0.113 | 0.88 | 3.29 |
| procedure_combinationsMultimodal | 0.25 | 0.92 | -1.503 | 0.133 | 0.04 | 1.53 |
| locationDeep | 0.55 | 0.47 | -1.277 | 0.202 | 0.22 | 1.38 |
| has_deficitTRUE | 1.52 | 0.33 | 1.267 | 0.205 | 0.80 | 2.90 |
| has_seizuresTRUE | 1.47 | 0.35 | 1.089 | 0.276 | 0.74 | 2.93 |
| is_surgeryTRUE | 0.66 | 0.41 | -1.025 | 0.306 | 0.29 | 1.47 |
| modified_rankin_score_presentation | 0.92 | 0.09 | -0.957 | 0.339 | 0.78 | 1.09 |
| has_hemorrhageTRUE | 0.78 | 0.32 | -0.780 | 0.436 | 0.41 | 1.46 |
| has_paresisTRUE | 1.32 | 0.37 | 0.750 | 0.453 | 0.64 | 2.73 |
| modified_rankin_score_final | 1.06 | 0.11 | 0.550 | 0.582 | 0.85 | 1.33 |
| modified_rankin_score_pretreatment | 1.02 | 0.10 | 0.193 | 0.847 | 0.84 | 1.24 |
| has_associated_aneurysmTRUE | 1.06 | 0.40 | 0.144 | 0.885 | 0.48 | 2.34 |
Interaction analysis
Adjusted - Recoded
Fit logistic.
# Define columns
cols <- unname(c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
EXPOSURES_CATEGORICAL
))
# Remove unwanted columns
unwanted <- c("modified_rankin_score_final")
cols <- cols[!cols %in% unwanted]
adjustors <- c(
"age_at_first_treatment_yrs",
"is_male"
)
# Initialize
out <- list()
df <- df_multi
k <- 1
for (i in seq_along(cols)) {
# Do not fit model for variables that we are adjusting for
if (cols[i] %in% adjustors) {
next
}
# Create formula
model <- as.formula(paste(
CHOSEN_OUTCOME,
"~",
"age_at_first_treatment_yrs + is_male +",
cols[i],
"* has_hemorrhage",
sep = ""
))
fit <- suppressWarnings(glm(
model,
data = df,
family = binomial()
))
# Calculate robust standard errors (sandwich)
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)
# Summarize model coefficients
fit_summary <-
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
out[[k]] <- stylized
k <- k + 1
}Print all results results.
# Define values
unwanted <- c(
"is_maleTRUE",
"age_at_first_treatment_yrs",
"(Intercept)",
"has_hemorrhageTRUE"
)
# Initialize objects
isolated_values <- list()
# Get main and interaction effects
for (i in seq_along(out)) {
# Get data
result <- out[[i]]
# Get all main effects of interest
main_effects <-
result %>%
filter(!Predictors %in% unwanted) %>%
filter(!str_detect(Predictors, ":")) %>%
select(-"SE", -"Z-scores")
# Get the interaction effects
interaction_effects <-
result %>%
filter(!Predictors %in% unwanted) %>%
filter(str_detect(Predictors, ":")) %>%
mutate(Predictors = str_extract(Predictors, "^[^:]+")) %>%
select(-"SE", -"Z-scores") %>%
rename_with(~ paste("Interaction -", .), -Predictors)
isolated_values[[i]] <-
main_effects %>%
left_join(interaction_effects, by = "Predictors")
}Create and print table
# Create table
univariable_adjusted_recoded_interactions <-
isolated_values %>%
bind_rows() %>%
arrange(`Interaction - P-values`)
# Print
univariable_adjusted_recoded_interactions %>%
sable()| Predictors | Odds Ratios (OR) | P-values | CI (low) | CI (high) | Interaction - Odds Ratios (OR) | Interaction - P-values | Interaction - CI (low) | Interaction - CI (high) |
|---|---|---|---|---|---|---|---|---|
| procedure_combinationsMultimodal | 0.00 | 0.000 | 0.00 | 0.00 | 1.80e+06 | 0.000 | 1.02e+05 | 3.18e+07 |
| procedure_combinationsR | 0.00 | 0.000 | 0.00 | 0.00 | 1.21e+06 | 0.000 | 4.57e+04 | 3.20e+07 |
| procedure_combinationsS | 0.00 | 0.000 | 0.00 | 0.00 | 1.19e+06 | 0.000 | 4.58e+04 | 3.10e+07 |
| modified_rankin_score_pretreatment | 1.93 | 0.016 | 1.13 | 3.30 | 4.70e-01 | 0.015 | 2.60e-01 | 8.60e-01 |
| modified_rankin_score_presentation | 1.52 | 0.127 | 0.89 | 2.59 | 5.50e-01 | 0.047 | 3.00e-01 | 9.90e-01 |
| modified_rankin_score_postop_within_1_week | 1.69 | 0.007 | 1.15 | 2.47 | 6.20e-01 | 0.050 | 3.80e-01 | 1.00e+00 |
| size_score | 2.61 | 0.012 | 1.23 | 5.55 | 4.70e-01 | 0.118 | 1.80e-01 | 1.21e+00 |
| has_deficitTRUE | 2.90 | 0.039 | 1.06 | 7.97 | 3.40e-01 | 0.121 | 9.00e-02 | 1.32e+00 |
| is_spetzler_martin_grade_less_than_4TRUE | 0.21 | 0.003 | 0.07 | 0.59 | 3.12e+00 | 0.127 | 7.20e-01 | 1.34e+01 |
| spetzler_martin_grade | 1.82 | 0.018 | 1.11 | 2.97 | 6.90e-01 | 0.224 | 3.80e-01 | 1.25e+00 |
| max_size_cm | 1.25 | 0.081 | 0.97 | 1.60 | 8.30e-01 | 0.236 | 6.20e-01 | 1.13e+00 |
| is_eloquent_locationTRUE | 2.83 | 0.078 | 0.89 | 9.04 | 4.70e-01 | 0.301 | 1.10e-01 | 1.98e+00 |
| lawton_young_grade | 1.59 | 0.015 | 1.09 | 2.31 | 7.80e-01 | 0.309 | 4.90e-01 | 1.25e+00 |
| has_associated_aneurysmTRUE | 0.59 | 0.517 | 0.12 | 2.89 | 2.38e+00 | 0.356 | 3.80e-01 | 1.51e+01 |
| is_diffuse_nidusTRUE | 1.31 | 0.692 | 0.35 | 4.89 | 2.22e+00 | 0.375 | 3.80e-01 | 1.30e+01 |
| has_deep_venous_drainageTRUE | 1.37 | 0.526 | 0.52 | 3.62 | 1.61e+00 | 0.489 | 4.20e-01 | 6.20e+00 |
| locationHemispheric | 0.33 | 0.176 | 0.07 | 1.63 | 5.40e-01 | 0.524 | 8.00e-02 | 3.67e+00 |
| venous_drainageSuperficial | 0.67 | 0.426 | 0.26 | 1.78 | 6.60e-01 | 0.550 | 1.70e-01 | 2.55e+00 |
| has_seizuresTRUE | 1.80 | 0.255 | 0.66 | 4.92 | 6.60e-01 | 0.557 | 1.60e-01 | 2.68e+00 |
| is_surgeryTRUE | 0.94 | 0.946 | 0.18 | 4.94 | 6.60e-01 | 0.668 | 1.00e-01 | 4.40e+00 |
| locationDeep | 0.41 | 0.306 | 0.08 | 2.24 | 1.55e+00 | 0.673 | 2.00e-01 | 1.18e+01 |
| has_paresisTRUE | 1.22 | 0.763 | 0.33 | 4.49 | 1.20e+00 | 0.826 | 2.40e-01 | 5.90e+00 |
Eloquence.
df %>%
count(has_hemorrhage, is_eloquent_location, !!sym(CHOSEN_OUTCOME)) %>%
drop_na() %>%
arrange(has_hemorrhage, is_eloquent_location) %>%
group_by(has_hemorrhage, is_eloquent_location) %>%
mutate(
num_total = sum(n),
pct_total = scales::percent(n / num_total, 1)
) %>%
sable()| has_hemorrhage | is_eloquent_location | has_complication_minor | n | num_total | pct_total |
|---|---|---|---|---|---|
| FALSE | FALSE | FALSE | 24 | 28 | 86% |
| FALSE | FALSE | TRUE | 4 | 28 | 14% |
| FALSE | TRUE | FALSE | 45 | 64 | 70% |
| FALSE | TRUE | TRUE | 19 | 64 | 30% |
| TRUE | FALSE | FALSE | 50 | 61 | 82% |
| TRUE | FALSE | TRUE | 11 | 61 | 18% |
| TRUE | TRUE | FALSE | 59 | 76 | 78% |
| TRUE | TRUE | TRUE | 17 | 76 | 22% |
Deficit.
df %>%
count(has_hemorrhage, has_deficit, !!sym(CHOSEN_OUTCOME)) %>%
drop_na() %>%
arrange(has_hemorrhage, has_deficit) %>%
group_by(has_hemorrhage, has_deficit) %>%
mutate(
num_total = sum(n),
pct_total = scales::percent(n / num_total, 1)
) %>%
sable()| has_hemorrhage | has_deficit | has_complication_minor | n | num_total | pct_total |
|---|---|---|---|---|---|
| FALSE | FALSE | FALSE | 46 | 58 | 79% |
| FALSE | FALSE | TRUE | 12 | 58 | 21% |
| FALSE | TRUE | FALSE | 23 | 34 | 68% |
| FALSE | TRUE | TRUE | 11 | 34 | 32% |
| TRUE | FALSE | FALSE | 73 | 90 | 81% |
| TRUE | FALSE | TRUE | 17 | 90 | 19% |
| TRUE | TRUE | FALSE | 41 | 52 | 79% |
| TRUE | TRUE | TRUE | 11 | 52 | 21% |
Seizures.
df %>%
count(has_hemorrhage, has_seizures, !!sym(CHOSEN_OUTCOME)) %>%
drop_na() %>%
arrange(has_hemorrhage, has_seizures) %>%
group_by(has_hemorrhage, has_seizures) %>%
mutate(
num_total = sum(n),
pct_total = scales::percent(n / num_total, 1)
) %>%
sable()| has_hemorrhage | has_seizures | has_complication_minor | n | num_total | pct_total |
|---|---|---|---|---|---|
| FALSE | FALSE | FALSE | 51 | 65 | 78% |
| FALSE | FALSE | TRUE | 14 | 65 | 22% |
| FALSE | TRUE | FALSE | 18 | 27 | 67% |
| FALSE | TRUE | TRUE | 9 | 27 | 33% |
| TRUE | FALSE | FALSE | 89 | 110 | 81% |
| TRUE | FALSE | TRUE | 21 | 110 | 19% |
| TRUE | TRUE | FALSE | 25 | 32 | 78% |
| TRUE | TRUE | TRUE | 7 | 32 | 22% |
Selective inference
Read the following guides: - How to use glmnet - Selective inference using forward selection
Setup
Define unwanted columns.
unwanted_all <- c(
# Use values at or before first treatment
"modified_rankin_score_postop_within_1_week",
"modified_rankin_score_final"
)
unwanted_without_gradings <- c(
unwanted_all,
# Spetzler-Martin grade (size score + eloquent location + venous drainage)
"spetzler_martin_grade",
"is_spetzler_martin_grade_less_than_4",
"size_score", # Already covered by the more detailed max_size_cm
"venous_drainage", # Already covered by has_deep_venous_drainage
"location", # Already covered to some extent by eloquence
# Lawton-Young grade (SM + age + diffuse nidus + hemorrhage)
# (SM is heavily overlapping with LY, so include its components)
"lawton_young_grade",
# Use mRS pre-treatment (more predictive)
"modified_rankin_score_presentation" # "modified_rankin_score_pretreatment"
)
# Define cols of interest - use grading scores whenever these exist
unwanted_with_gradings <- c(
# Spetzler-Martin grade (size score + eloquent location + venous drainage)
"is_spetzler_martin_grade_less_than_4", # The grade has lower p-value
"size_score",
"max_size_cm",
"is_eloquent_location",
# "location", # Include this as not highly correlated with SM
"has_deep_venous_drainage",
"venous_drainage",
# Lawton-Young grade (SM + age + diffuse nidus + hemorrhage)
# (SM is heavily overlapping with LY, so include its components)
"lawton_young_grade",
# "is_diffuse_nidus",
# "has_hemorrhage",
# Use values at or before first treatment
"modified_rankin_score_final",
"modified_rankin_score_postop_within_1_week",
# Use mRS pre-treatment (more predictive)
"modified_rankin_score_presentation"
# "modified_rankin_score_pretreatment"
)Create dataset.
# Define cols
cols <- unname(c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
EXPOSURES_CATEGORICAL,
CHOSEN_OUTCOME
))
# Define column sets for each analysis
cols_all <- cols[!cols %in% unwanted_all]
cols_with_gradings <- cols[!cols %in% unwanted_with_gradings]
cols_without_gradings <- cols[!cols %in% unwanted_without_gradings]
# Create df of interest
df_all <-
# Use the non-recoded dataset
df_uni %>%
dplyr::select(all_of(cols_all)) %>%
drop_na()
df_with_gradings <-
# Use the non-recoded dataset
df_uni %>%
dplyr::select(all_of(cols_with_gradings)) %>%
drop_na()
df_no_scores <-
# Use the non-recoded dataset
df_uni %>%
dplyr::select(all_of(cols_without_gradings)) %>%
drop_na()
# Create formula
frla <- as.formula(paste(CHOSEN_OUTCOME, " ~ . - 1"))
# Create matrices with all variables
X_all <- model.matrix(frla, df_all)
y_all <- df_all %>%
pull(CHOSEN_OUTCOME) %>%
as.numeric()
# Create matrices with scores
X_with_gradings <- model.matrix(frla, df_with_gradings)
y_with_gradings <- df_with_gradings %>%
pull(CHOSEN_OUTCOME) %>%
as.numeric()
# Create matrices with score components
X_without_gradings <- model.matrix(frla, df_no_scores)
y_without_gradings <- df_no_scores %>%
pull(CHOSEN_OUTCOME) %>%
as.numeric()
# X should be centered for LASSO (necessary for glmnet)
# (scale = FALSE as this is done by the standardization step in glmnet)
X_all_scaled <- scale(X_all, center = T, scale = F)
X_with_gradings_scaled <- scale(X_with_gradings, center = T, scale = F)
X_without_gradings_scaled <- scale(X_without_gradings, center = T, scale = F)
# See the names of X to match column numbers to column names later on
colnames(X_without_gradings_scaled) [1] "age_at_first_treatment_yrs" "modified_rankin_score_pretreatment"
[3] "max_size_cm" "is_maleFALSE"
[5] "is_maleTRUE" "is_eloquent_locationTRUE"
[7] "has_deep_venous_drainageTRUE" "is_diffuse_nidusTRUE"
[9] "has_hemorrhageTRUE" "has_seizuresTRUE"
[11] "has_deficitTRUE" "has_paresisTRUE"
[13] "has_associated_aneurysmTRUE" "is_surgeryTRUE"
[15] "procedure_combinationsER" "procedure_combinationsERS"
[17] "procedure_combinationsES" "procedure_combinationsR"
[19] "procedure_combinationsRS" "procedure_combinationsS"
Stepwise
Use step-wise linear regression methods.
# Set seed
set.seed(33)
# Run forward step-wise
fsfit <- fs(
x = X_without_gradings_scaled,
y = y_without_gradings,
maxsteps = 2000,
intercept = TRUE,
normalize = FALSE
)
# Estimate sigma
sigmahat <- estimateSigma(
x = X_without_gradings_scaled,
y = y_without_gradings,
intercept = TRUE,
standardize = FALSE
)$sigmahat# Compute sequential p-values and confidence intervals - for all
# (sigma estimated from full model)
out.seq <- fsInf(fsfit, type = "active", sigma = sigmahat)
out.seq
Call:
fsInf(obj = fsfit, sigma = sigmahat, type = "active")
Standard deviation of noise (specified or estimated) sigma = 0.411
Sequential testing results with alpha = 0.100
Step Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
1 16 2.83e-01 3.407 0.006 1.11e-01 4.19e-01 0.049 0.050
2 1 1.10e-02 1.525 0.678 -9.40e-02 2.10e-02 0.050 0.049
3 5 7.50e-02 1.353 0.660 -1.80e+00 7.65e-01 0.050 0.050
4 4 9.40e+13 1.375 0.664 -2.20e+15 9.13e+14 0.050 0.050
5 10 7.70e-02 1.195 0.923 -Inf 5.42e-01 0.000 0.050
6 11 6.60e-02 1.147 0.044 9.60e-02 Inf 0.050 0.000
7 18 -7.50e-02 -1.039 0.481 -1.41e+00 1.39e+00 0.050 0.050
8 7 6.80e-02 1.134 0.387 -9.43e-01 1.46e+00 0.050 0.050
9 14 -5.50e-02 -0.729 0.355 -1.84e+00 9.91e-01 0.050 0.050
10 15 -4.17e-01 -2.121 0.063 -8.81e-01 3.80e-02 0.050 0.049
11 13 -4.80e-02 -0.671 0.726 -4.83e-01 1.67e+00 0.050 0.050
12 9 -2.80e-02 -0.457 0.963 3.65e-01 Inf 0.050 0.000
13 20 4.60e-02 0.542 0.027 6.98e-01 Inf 0.050 0.000
14 6 2.50e-02 0.383 0.689 -4.06e+00 1.60e+00 0.050 0.050
15 8 2.70e-02 0.305 0.692 -Inf 5.64e+00 0.000 0.050
16 3 3.00e-03 0.177 0.116 -2.45e-01 Inf 0.050 0.000
17 12 -7.00e-03 -0.071 0.974 1.67e+00 Inf 0.050 0.000
18 17 -9.00e-03 -0.073 0.046 -Inf -4.00e-01 0.000 0.050
19 19 -3.09e+13 -3.331 0.015 -Inf -3.24e+13 0.000 0.050
20 2 -1.00e-03 -0.030 0.566 -1.72e+00 2.31e+00 0.050 0.050
Estimated stopping point from ForwardStop rule = 1
# Compute p-values and confidence intervals after AIC stopping - for selected
out.aic <- fsInf(fsfit, type = "aic", sigma = sigmahat)
out.aic
Call:
fsInf(obj = fsfit, sigma = sigmahat, type = "aic")
Standard deviation of noise (specified or estimated) sigma = 0.411
Testing results at step = 2, with alpha = 0.100
Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
16 0.277 3.33 0.008 0.101 0.430 0.05 0.049
1 0.011 1.52 0.808 -0.184 0.016 0.05 0.050
Estimated stopping point from AIC rule = 2
# Compute p-values and confidence intervals after 5 fixed steps
out.fix <- fsInf(fsfit, type = "all", k = 5, sigma = sigmahat)
out.fix
Call:
fsInf(obj = fsfit, sigma = sigmahat, k = 5, type = "all")
Standard deviation of noise (specified or estimated) sigma = 0.411
Testing results at step = 5, with alpha = 0.100
Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
16 0.267 3.201 0.013 0.176 3.448 0.05 0.050
1 0.011 1.588 0.664 -0.091 0.021 0.05 0.049
5 0.100 1.659 0.967 -Inf -0.166 0.00 0.050
4 0.031 0.576 0.903 -Inf 0.809 0.00 0.050
10 0.078 1.222 0.924 -Inf 0.513 0.00 0.050
LASSO - all
Use LASSO logistic regression using all available variables.
# Set seed
set.seed(141845)
# Define values
X <- X_all
y <- y_all
# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
x = X,
y = y,
family = "binomial",
alpha = 1,
nfolds = 10
)
# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se
# Run glmnet
gfit <- glmnet(
x = X,
y = y,
family = "binomial",
alpha = 1,
intercept = TRUE,
standardize = TRUE
)
# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
x = X,
y = y,
object = gfit,
s = lambda,
exact = TRUE
)
# Estimate sigma
gsigmahat <- estimateSigma(
x = X,
y = y,
intercept = TRUE,
standardize = TRUE
)$sigmahat
# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
x = X,
y = y,
beta = beta,
lambda = lambda,
# Level of significance
alpha = 0.1,
family = "binomial",
intercept = TRUE,
sigma = gsigmahat
)
# Prettify
tibble(
names = names(out$vars),
odds_ratio = exp(out$coef0),
pvalue = out$pv,
ci_lo = exp(out$ci[, 1]),
ci_hi = exp(out$ci[, 2])
) %>%
arrange(pvalue) %>%
sable()| names | odds_ratio | pvalue | ci_lo | ci_hi |
|---|---|---|---|---|
| locationHemispheric | 0.306 | 0.005 | 0.169 | 0.612 |
| procedure_combinationsR | 0.356 | 0.140 | 0.164 | 1.803 |
| procedure_combinationsERS | 2.022 | 0.154 | 0.615 | 4.603 |
| age_at_first_treatment_yrs | 1.088 | 0.213 | 0.914 | 1.167 |
| is_spetzler_martin_grade_less_than_4TRUE | 0.545 | 0.215 | 0.297 | 1.962 |
| has_seizuresTRUE | 1.890 | 0.320 | 0.290 | 3.419 |
| is_maleFALSE | 0.666 | 0.636 | 0.440 | 22.995 |
LASSO - without scores
Use LASSO logistic regression not based on previous gradings.
# Set seed
set.seed(141845)
# Define values
X <- X_without_gradings
y <- y_without_gradings
# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
x = X,
y = y,
family = "binomial",
alpha = 1,
nfolds = 10
)
# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se
# Run glmnet
gfit <- glmnet(
x = X,
y = y,
family = "binomial",
alpha = 1,
intercept = TRUE,
standardize = TRUE
)
# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
x = X,
y = y,
object = gfit,
s = lambda,
exact = TRUE
)
# Estimate sigma
gsigmahat <- estimateSigma(
x = X,
y = y,
intercept = TRUE,
standardize = TRUE
)$sigmahat
# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
x = X,
y = y,
beta = beta,
lambda = lambda,
# Level of significance
alpha = 0.1,
family = "binomial",
intercept = TRUE,
sigma = gsigmahat
)
# Prettify
tibble(
names = names(out$vars),
odds_ratio = exp(out$coef0),
pvalue = out$pv,
ci_lo = exp(out$ci[, 1]),
ci_hi = exp(out$ci[, 2])
) %>%
arrange(pvalue) %>%
sable()| names | odds_ratio | pvalue | ci_lo | ci_hi |
|---|---|---|---|---|
| procedure_combinationsERS | 4.09 | 0.018 | 1.42 | 8.45 |
LASSO - without components
Use LASSO logistic regression based on previous gradings.
# Set seed
set.seed(141845)
# Define values
X <- X_with_gradings
y <- y_with_gradings
# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
x = X,
y = y,
family = "binomial",
alpha = 1,
nfolds = 10
)
# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se
# Run glmnet
gfit <- glmnet(
x = X,
y = y,
family = "binomial",
alpha = 1,
intercept = TRUE,
standardize = TRUE
)
# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
x = X,
y = y,
object = gfit,
s = lambda,
exact = TRUE
)
# Estimate sigma
gsigmahat <- estimateSigma(
x = X,
y = y,
intercept = TRUE,
standardize = TRUE
)$sigmahat
# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
x = X,
y = y,
beta = beta,
lambda = lambda,
# Level of significance
alpha = 0.1,
family = "binomial",
intercept = TRUE,
sigma = gsigmahat
)
# Prettify
tibble(
names = names(out$vars),
odds_ratio = exp(out$coef0),
pvalue = out$pv,
ci_lo = exp(out$ci[, 1]),
ci_hi = exp(out$ci[, 2])
) %>%
arrange(pvalue) %>%
sable()| names | odds_ratio | pvalue | ci_lo | ci_hi |
|---|---|---|---|---|
| locationHemispheric | 0.312 | 0.007 | 0.150 | 0.651 |
| procedure_combinationsERS | 2.124 | 0.130 | 0.660 | 7.744 |
| procedure_combinationsR | 0.365 | 0.179 | 0.170 | 2.277 |
| age_at_first_treatment_yrs | 1.078 | 0.325 | 0.862 | 1.154 |
| has_seizuresTRUE | 1.898 | 0.355 | 0.233 | 3.383 |
| spetzler_martin_grade | 1.185 | 0.508 | 0.442 | 1.484 |
| is_maleFALSE | 0.660 | 0.700 | 0.460 | 66.877 |
Multivariable - Without scores
Setup
Feature selection
Create predictors.
# Define p-value threshold
pval_threshold <- 0.2
# Get the predictors of interest
predictors <-
univariable_adjusted_recoded %>%
bind_rows(univariable_unadjusted_recoded) %>%
filter(`P-values` < pval_threshold) %>%
pull(`Predictors`)
# Clean categorical variables
predictors <-
predictors %>%
str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
unique()
# Add adjustment variables
predictors <- unique(c(predictors))
# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_without_gradings]
# Print
print(predictors)[1] "procedure_combinations" "max_size_cm"
[3] "is_eloquent_location" "is_diffuse_nidus"
[5] "has_deep_venous_drainage" "age_at_first_treatment_yrs"
Model selection
Fit logistic.
# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)
# Create formula
model <- as.formula(paste(
CHOSEN_OUTCOME,
"~",
paste(predictors, collapse = " + ")
))
fit <- glm(
model,
data = df,
family = binomial()
)
# Save
fit_without_grading <- fitPrint results.
# Summarize model coefficients
fit_summary <-
fit %>%
broom::tidy(exponentiate = T, conf.int = T) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
stylized %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| procedure_combinationsR | 0.07 | 1.07 | -2.527 | 0.012 | 0.01 | 0.53 |
| procedure_combinationsMultimodal | 0.17 | 0.97 | -1.841 | 0.066 | 0.02 | 1.12 |
| procedure_combinationsS | 0.19 | 1.02 | -1.617 | 0.106 | 0.02 | 1.42 |
| age_at_first_treatment_yrs | 1.08 | 0.05 | 1.583 | 0.114 | 0.98 | 1.19 |
| has_deep_venous_drainageTRUE | 1.62 | 0.38 | 1.275 | 0.202 | 0.78 | 3.48 |
| max_size_cm | 1.07 | 0.09 | 0.795 | 0.426 | 0.90 | 1.27 |
| is_diffuse_nidusTRUE | 1.53 | 0.54 | 0.793 | 0.428 | 0.52 | 4.37 |
| is_eloquent_locationTRUE | 1.29 | 0.40 | 0.645 | 0.519 | 0.60 | 2.85 |
| (Intercept) | 0.34 | 1.14 | -0.950 | 0.342 | 0.03 | 3.59 |
Inference
Create robust CIs using the sandwich estimator.
# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)Stylize results.
# Summarize model coefficients
fit_summary <-
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(exponentiate = F, conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
multivariable_pvalue <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
multivariable_pvalue %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| procedure_combinationsR | 0.07 | 1.13 | -2.387 | 0.017 | 0.01 | 0.62 |
| procedure_combinationsMultimodal | 0.17 | 1.02 | -1.762 | 0.078 | 0.02 | 1.22 |
| procedure_combinationsS | 0.19 | 1.07 | -1.551 | 0.121 | 0.02 | 1.55 |
| age_at_first_treatment_yrs | 1.08 | 0.05 | 1.521 | 0.128 | 0.98 | 1.19 |
| has_deep_venous_drainageTRUE | 1.62 | 0.39 | 1.233 | 0.217 | 0.75 | 3.50 |
| max_size_cm | 1.07 | 0.09 | 0.774 | 0.439 | 0.90 | 1.27 |
| is_diffuse_nidusTRUE | 1.53 | 0.57 | 0.755 | 0.450 | 0.51 | 4.66 |
| is_eloquent_locationTRUE | 1.29 | 0.41 | 0.622 | 0.534 | 0.58 | 2.89 |
| (Intercept) | 0.34 | 1.26 | -0.865 | 0.387 | 0.03 | 3.96 |
Plot the coefficients with their confidence intervals.
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(exponentiate = F, conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
filter(term != "(Intercept)") %>%
arrange(desc(estimate)) %>%
mutate(term = factor(term, levels = term)) %>%
ggplot(aes(x = estimate, y = term, color = term)) +
geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
labs(x = "Odds ratio", y = NULL) +
guides(color = F) +
theme(
axis.text = element_text(size = 12),
axis.title = element_text(size = 13)
)Predict
Predict.
Plot histogram of predictions.
tibble(
Predictions = predicted_probs,
Truth = fit$model[, CHOSEN_OUTCOME]
) %>%
pivot_longer(
cols = everything(),
names_to = "keys",
values_to = "values"
) %>%
mutate(keys = fct_inorder(keys)) %>%
ggplot(aes(x = values, color = keys, fill = keys)) +
geom_histogram(alpha = 0.7) +
geom_vline(xintercept = 0.5, linetype = 2, alpha = 0.5) +
facet_wrap(vars(keys), ncol = 1, scales = "fixed") +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text = element_text(size = 12),
strip.text.x = element_text(size = 13),
legend.position = "none"
)Evaluate
How many examples were used?
# All examples
num_all_examples <- nrow(fit$data)
num_all_examples_positive <- sum(fit$data[, CHOSEN_OUTCOME], na.rm = T)
num_complete_cases <- nrow(fit$model)
num_complete_cases_positive <- sum(fit$model[, CHOSEN_OUTCOME], na.rm = T)
# Print
sprintf(
"
All examples = %s (%s with outcome)
Fitted examples = %s (%s with outcome)",
num_all_examples,
num_all_examples_positive,
num_complete_cases,
num_complete_cases_positive
) %>% cat()
All examples = 232 (50 with outcome)
Fitted examples = 223 (48 with outcome)
Residual analysis.
Pseudo R-squared.
fitting null model for pseudo-r2
llh llhNull G2 McFadden r2ML r2CU
-108.0485 -116.1441 16.1912 0.0697 0.0700 0.1082
ROC-AUC.
# Compute
auroc <- cvAUC::ci.cvAUC(
predictions = predicted_probs,
labels = fit$model[, CHOSEN_OUTCOME]
)
# Print
sprintf(
"Cross-validated AUROC = %.3f (SE, %.3f; %s%% CI, %.3f-%.3f)",
auroc$cvAUC,
auroc$se,
auroc$confidence * 100,
auroc$ci[1],
auroc$ci[2]
)[1] "Cross-validated AUROC = 0.687 (SE, 0.044; 95% CI, 0.601-0.773)"
PR-AUC.
# Construct object
precrec_obj_tr <- precrec::evalmod(
scores = predicted_probs,
labels = fit$model[, CHOSEN_OUTCOME],
calc_avg = F,
cb_alpha = 0.05
)
# Print
# NOTE: Can use precrec::auc_ci to get CIs, but need to have multiple datasets
precrec_obj_tr %>%
precrec::part() %>%
precrec::pauc() %>%
sable()| modnames | dsids | curvetypes | paucs | spaucs |
|---|---|---|---|---|
| m1 | 1 | ROC | 0.687 | 0.687 |
| m1 | 1 | PRC | 0.395 | 0.395 |
ROC and PR curves.
Another version of the ROC curve.
# Calculate ROC curve
roc_curve <- pROC::roc(fit$model[, CHOSEN_OUTCOME] ~ predicted_probs)
# Plot ROC curve
plot(roc_curve)Area under the curve: 0.687
Plot all performance measures.
# Construct object
precrec_obj_tr2 <- precrec::evalmod(
scores = predicted_probs,
labels = fit$model[, CHOSEN_OUTCOME],
mode = "basic"
)
# Plot
precrec_obj_tr2 %>% autoplot()Confusion matrix.
Kappa statistic: 0 = No agreement; 0.1-0.2 = Slight agreement; 0.2-0.4 = Fair agreement; 0.4-0.6 = Moderate agreement; 0.6-0.8 = Substantial agreement; 0.8 - 1 = Near perfect agreement
No Information Rate (NIR): The proportion of the largest observed class - for example, if we had 35 patients and 23 patients had the outcome, this is the largest class and the NIR is 23/35 = 0.657. The larger the deviation of Accuracy from NIR the more confident we are that the model is not just choosing the largest class.
# Scores
pred <- factor(ifelse(predicted_probs > 0.5, "Event", "No Event"))
truth <- factor(ifelse(fit$model[, CHOSEN_OUTCOME], "Event", "No Event"))
# Count values per category
xtab <- table(
scores = pred,
labels = truth
)
# Compute
confusion_matrix <-
caret::confusionMatrix(
xtab,
positive = "Event",
prevalence = NULL, # Provide prevalence as a proportion here if you want
mode = "sens_spec"
)
# Print
confusion_matrixConfusion Matrix and Statistics
labels
scores Event No Event
Event 3 2
No Event 45 173
Accuracy : 0.789
95% CI : (0.73, 0.841)
No Information Rate : 0.785
P-Value [Acc > NIR] : 0.474
Kappa : 0.076
Mcnemar's Test P-Value : 8.99e-10
Sensitivity : 0.0625
Specificity : 0.9886
Pos Pred Value : 0.6000
Neg Pred Value : 0.7936
Prevalence : 0.2152
Detection Rate : 0.0135
Detection Prevalence : 0.0224
Balanced Accuracy : 0.5255
'Positive' Class : Event
Variable importance
Plot importance (based on magnitude of z-score).
# Calculate importance
varimp <-
fit %>%
caret::varImp() %>%
as_tibble(rownames = "rn") %>%
mutate(Predictor = factor(rn) %>% fct_reorder(Overall)) %>%
dplyr::select(Predictor, Importance = Overall)
# Plot variable importance
varimp %>%
ggplot(aes(Predictor, Importance, fill = Importance)) +
geom_bar(stat = "identity", alpha = 0.9, width = 0.65) +
coord_flip() +
guides(fill = F) +
labs(y = "\nVariable Importance", x = NULL) +
scale_fill_viridis_c() +
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.border = element_blank(),
axis.text.x = element_text(margin = margin(t = 7), size = 12),
axis.text.y = element_text(margin = margin(r = -10), size = 12),
axis.title = element_text(size = 13)
)Multivariable - Without components
Feature selection
Create predictors.
# Define p-value threshold
pval_threshold <- 0.2
# Get the predictors of interest
predictors <-
univariable_unadjusted_recoded %>%
bind_rows(univariable_adjusted) %>%
filter(`P-values` < pval_threshold) %>%
pull(`Predictors`)
# Clean categorical variables
predictors <-
predictors %>%
str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
unique()
# Add adjustment variables
predictors <- unique(c(predictors))
# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]
# Print
print(predictors)[1] "location" "spetzler_martin_grade"
[3] "procedure_combinations" "age_at_first_treatment_yrs"
[5] "is_diffuse_nidus"
Model selection
Fit logistic.
# Define data
df <- df_multi
# Create formula
model <- as.formula(paste(
CHOSEN_OUTCOME,
"~",
paste(predictors, collapse = " + ")
))
fit <- glm(
model,
data = df,
family = binomial()
)
fit_with_grading <- fitPrint results.
# Summarize model coefficients
fit_summary <-
fit %>%
broom::tidy(exponentiate = T, conf.int = T) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
stylized %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| locationHemispheric | 0.20 | 0.48 | -3.386 | 0.001 | 0.07 | 0.50 |
| procedure_combinationsR | 0.08 | 1.07 | -2.363 | 0.018 | 0.01 | 0.63 |
| spetzler_martin_grade | 1.56 | 0.20 | 2.211 | 0.027 | 1.06 | 2.33 |
| locationDeep | 0.37 | 0.55 | -1.833 | 0.067 | 0.12 | 1.07 |
| age_at_first_treatment_yrs | 1.09 | 0.05 | 1.755 | 0.079 | 0.99 | 1.21 |
| procedure_combinationsMultimodal | 0.24 | 0.98 | -1.445 | 0.149 | 0.03 | 1.66 |
| procedure_combinationsS | 0.32 | 1.03 | -1.103 | 0.270 | 0.04 | 2.43 |
| is_diffuse_nidusTRUE | 1.33 | 0.53 | 0.542 | 0.588 | 0.47 | 3.72 |
| (Intercept) | 0.34 | 1.20 | -0.901 | 0.368 | 0.03 | 4.04 |
Inference
Create robust CIs using the sandwich estimator.
# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)Stylize results.
# Summarize model coefficients
fit_summary <-
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(exponentiate = F, conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
multivariable_pvalue <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
multivariable_pvalue %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| locationHemispheric | 0.20 | 0.45 | -3.618 | 0.000 | 0.08 | 0.47 |
| procedure_combinationsR | 0.08 | 1.17 | -2.168 | 0.030 | 0.01 | 0.78 |
| spetzler_martin_grade | 1.56 | 0.21 | 2.092 | 0.036 | 1.03 | 2.35 |
| locationDeep | 0.37 | 0.55 | -1.838 | 0.066 | 0.13 | 1.07 |
| age_at_first_treatment_yrs | 1.09 | 0.05 | 1.810 | 0.070 | 0.99 | 1.20 |
| procedure_combinationsMultimodal | 0.24 | 1.07 | -1.316 | 0.188 | 0.03 | 1.99 |
| procedure_combinationsS | 0.32 | 1.12 | -1.012 | 0.311 | 0.04 | 2.90 |
| is_diffuse_nidusTRUE | 1.33 | 0.52 | 0.549 | 0.583 | 0.48 | 3.68 |
| (Intercept) | 0.34 | 1.35 | -0.798 | 0.425 | 0.02 | 4.82 |
Plot the coefficients with their confidence intervals.
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(exponentiate = F, conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
filter(term != "(Intercept)") %>%
arrange(desc(estimate)) %>%
mutate(term = factor(term, levels = term)) %>%
ggplot(aes(x = estimate, y = term, color = term)) +
geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
labs(x = "Odds ratio", y = NULL) +
guides(color = F) +
theme(
axis.text = element_text(size = 12),
axis.title = element_text(size = 13)
)Predict
Predict.
Plot histogram of predictions.
tibble(
Predictions = predicted_probs,
Truth = fit$model[, CHOSEN_OUTCOME]
) %>%
pivot_longer(
cols = everything(),
names_to = "keys",
values_to = "values"
) %>%
mutate(keys = fct_inorder(keys)) %>%
ggplot(aes(x = values, color = keys, fill = keys)) +
geom_histogram(alpha = 0.7) +
geom_vline(xintercept = 0.5, linetype = 2, alpha = 0.5) +
facet_wrap(vars(keys), ncol = 1, scales = "fixed") +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text = element_text(size = 12),
strip.text.x = element_text(size = 13),
legend.position = "none"
)Evaluate
How many examples were used?
# All examples
num_all_examples <- nrow(fit$data)
num_all_examples_positive <- sum(fit$data[, CHOSEN_OUTCOME], na.rm = T)
num_complete_cases <- nrow(fit$model)
num_complete_cases_positive <- sum(fit$model[, CHOSEN_OUTCOME], na.rm = T)
# Print
sprintf(
"
All examples = %s (%s with outcome)
Fitted examples = %s (%s with outcome)",
num_all_examples,
num_all_examples_positive,
num_complete_cases,
num_complete_cases_positive
) %>% cat()
All examples = 236 (51 with outcome)
Fitted examples = 223 (48 with outcome)
Residual analysis.
Pseudo R-squared.
fitting null model for pseudo-r2
llh llhNull G2 McFadden r2ML r2CU
-100.920 -116.144 30.449 0.131 0.128 0.197
ROC-AUC.
# Compute
auroc <- cvAUC::ci.cvAUC(
predictions = predicted_probs,
labels = fit$model[, CHOSEN_OUTCOME]
)
# Print
sprintf(
"Cross-validated AUROC = %.3f (SE, %.3f; %s%% CI, %.3f-%.3f)",
auroc$cvAUC,
auroc$se,
auroc$confidence * 100,
auroc$ci[1],
auroc$ci[2]
)[1] "Cross-validated AUROC = 0.731 (SE, 0.042; 95% CI, 0.648-0.814)"
PR-AUC.
# Construct object
precrec_obj_tr <- precrec::evalmod(
scores = predicted_probs,
labels = fit$model[, CHOSEN_OUTCOME],
calc_avg = F,
cb_alpha = 0.05
)
# Print
# NOTE: Can use precrec::auc_ci to get CIs, but need to have multiple datasets
precrec_obj_tr %>%
precrec::part() %>%
precrec::pauc() %>%
sable()| modnames | dsids | curvetypes | paucs | spaucs |
|---|---|---|---|---|
| m1 | 1 | ROC | 0.731 | 0.731 |
| m1 | 1 | PRC | 0.490 | 0.490 |
ROC and PR curves.
Another version of the ROC curve.
# Calculate ROC curve
roc_curve <- pROC::roc(fit$model[, CHOSEN_OUTCOME] ~ predicted_probs)
# Plot ROC curve
plot(roc_curve)Area under the curve: 0.731
Plot all performance measures.
# Construct object
precrec_obj_tr2 <- precrec::evalmod(
scores = predicted_probs,
labels = fit$model[, CHOSEN_OUTCOME],
mode = "basic"
)
# Plot
precrec_obj_tr2 %>% autoplot()Confusion matrix.
Kappa statistic: 0 = No agreement; 0.1-0.2 = Slight agreement; 0.2-0.4 = Fair agreement; 0.4-0.6 = Moderate agreement; 0.6-0.8 = Substantial agreement; 0.8 - 1 = Near perfect agreement
No Information Rate (NIR): The proportion of the largest observed class - for example, if we had 35 patients and 23 patients had the outcome, this is the largest class and the NIR is 23/35 = 0.657. The larger the deviation of Accuracy from NIR the more confident we are that the model is not just choosing the largest class.
# Scores
pred <- factor(ifelse(predicted_probs > 0.5, "Event", "No Event"))
truth <- factor(ifelse(fit$model[, CHOSEN_OUTCOME], "Event", "No Event"))
# Count values per category
xtab <- table(
scores = pred,
labels = truth
)
# Compute
confusion_matrix <-
caret::confusionMatrix(
xtab,
positive = "Event",
prevalence = NULL, # Provide prevalence as a proportion here if you want
mode = "sens_spec"
)
# Print
confusion_matrixConfusion Matrix and Statistics
labels
scores Event No Event
Event 11 5
No Event 37 170
Accuracy : 0.812
95% CI : (0.754, 0.861)
No Information Rate : 0.785
P-Value [Acc > NIR] : 0.186
Kappa : 0.265
Mcnemar's Test P-Value : 1.72e-06
Sensitivity : 0.2292
Specificity : 0.9714
Pos Pred Value : 0.6875
Neg Pred Value : 0.8213
Prevalence : 0.2152
Detection Rate : 0.0493
Detection Prevalence : 0.0717
Balanced Accuracy : 0.6003
'Positive' Class : Event
Variable importance
Plot importance (based on magnitude of z-score).
# Calculate importance
varimp <-
fit %>%
caret::varImp() %>%
as_tibble(rownames = "rn") %>%
mutate(Predictor = factor(rn) %>% fct_reorder(Overall)) %>%
dplyr::select(Predictor, Importance = Overall)
# Plot variable importance
varimp %>%
ggplot(aes(Predictor, Importance, fill = Importance)) +
geom_bar(stat = "identity", alpha = 0.9, width = 0.65) +
coord_flip() +
guides(fill = F) +
labs(y = "\nVariable Importance", x = NULL) +
scale_fill_viridis_c() +
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.border = element_blank(),
axis.text.x = element_text(margin = margin(t = 7), size = 12),
axis.text.y = element_text(margin = margin(r = -10), size = 12),
axis.title = element_text(size = 13)
)Multivariable - With interactions
Presents with hemorrhage
Fit logistic.
# Define data
df <- df_multi
# Create formula
model <- as.formula(paste(
CHOSEN_OUTCOME,
"~",
"modified_rankin_score_pretreatment*has_hemorrhage +
spetzler_martin_grade*has_hemorrhage +
age_at_first_treatment_yrs*has_hemorrhage"
))
fit <- glm(
model,
data = df,
family = binomial()
)Print results.
# Summarize model coefficients
fit_summary <-
fit %>%
broom::tidy(exponentiate = T, conf.int = T) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
stylized %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| age_at_first_treatment_yrs | 1.25 | 0.09 | 2.58 | 0.010 | 1.07 | 1.50 |
| has_hemorrhageTRUE | 127.73 | 2.01 | 2.41 | 0.016 | 3.23 | 9449.99 |
| spetzler_martin_grade | 1.90 | 0.27 | 2.39 | 0.017 | 1.16 | 3.35 |
| modified_rankin_score_pretreatment:has_hemorrhageTRUE | 0.45 | 0.36 | -2.18 | 0.029 | 0.22 | 0.90 |
| modified_rankin_score_pretreatment | 2.04 | 0.33 | 2.16 | 0.031 | 1.09 | 4.09 |
| has_hemorrhageTRUE:age_at_first_treatment_yrs | 0.82 | 0.11 | -1.90 | 0.057 | 0.65 | 1.00 |
| has_hemorrhageTRUE:spetzler_martin_grade | 0.65 | 0.33 | -1.33 | 0.184 | 0.33 | 1.21 |
| (Intercept) | 0.00 | 1.79 | -3.82 | 0.000 | 0.00 | 0.02 |
Model comparison
Special cases
SM < 4
Select features.
# Define p-value threshold
pval_threshold <- 0.2
# Get the predictors of interest
predictors <-
univariable_adjusted_recoded %>%
bind_rows(univariable_unadjusted_recoded) %>%
filter(`P-values` < pval_threshold) %>%
pull(`Predictors`)
# Clean categorical variables
predictors <-
predictors %>%
str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
unique()
# Add adjustment variables
predictors <- unique(c(predictors))
# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]
# Replace SM by its binary variant
predictors <- c(predictors, "is_spetzler_martin_grade_less_than_4")
predictors <- predictors[!predictors == "spetzler_martin_grade"]
# Print
print(predictors)[1] "location"
[2] "procedure_combinations"
[3] "is_diffuse_nidus"
[4] "age_at_first_treatment_yrs"
[5] "is_spetzler_martin_grade_less_than_4"
Fit logistic.
# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)
# Create formula
model <- as.formula(paste(
CHOSEN_OUTCOME,
"~",
paste(predictors, collapse = " + ")
))
fit <- glm(
model,
data = df,
family = binomial()
)
# Save
fit_without_grading <- fitPrint results.
# Summarize model coefficients
fit_summary <-
fit %>%
broom::tidy(exponentiate = T, conf.int = T) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
stylized %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| locationHemispheric | 0.20 | 0.48 | -3.369 | 0.001 | 0.08 | 0.51 |
| is_spetzler_martin_grade_less_than_4TRUE | 0.34 | 0.43 | -2.454 | 0.014 | 0.15 | 0.80 |
| procedure_combinationsR | 0.08 | 1.08 | -2.392 | 0.017 | 0.01 | 0.61 |
| age_at_first_treatment_yrs | 1.11 | 0.05 | 2.050 | 0.040 | 1.01 | 1.23 |
| locationDeep | 0.42 | 0.54 | -1.621 | 0.105 | 0.14 | 1.20 |
| procedure_combinationsMultimodal | 0.23 | 0.99 | -1.494 | 0.135 | 0.03 | 1.58 |
| procedure_combinationsS | 0.28 | 1.03 | -1.241 | 0.215 | 0.03 | 2.09 |
| is_diffuse_nidusTRUE | 1.41 | 0.51 | 0.686 | 0.493 | 0.51 | 3.79 |
| (Intercept) | 2.20 | 1.18 | 0.667 | 0.505 | 0.22 | 26.17 |
Create robust CIs using the sandwich estimator.
# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)Stylize results.
# Summarize model coefficients
fit_summary <-
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(exponentiate = F, conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
multivariable_pvalue <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
multivariable_pvalue %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| locationHemispheric | 0.20 | 0.46 | -3.524 | 0.000 | 0.08 | 0.49 |
| is_spetzler_martin_grade_less_than_4TRUE | 0.34 | 0.45 | -2.350 | 0.019 | 0.14 | 0.84 |
| procedure_combinationsR | 0.08 | 1.18 | -2.189 | 0.029 | 0.01 | 0.76 |
| age_at_first_treatment_yrs | 1.11 | 0.05 | 2.136 | 0.033 | 1.01 | 1.22 |
| locationDeep | 0.42 | 0.55 | -1.573 | 0.116 | 0.14 | 1.24 |
| procedure_combinationsMultimodal | 0.23 | 1.08 | -1.363 | 0.173 | 0.03 | 1.91 |
| procedure_combinationsS | 0.28 | 1.13 | -1.140 | 0.254 | 0.03 | 2.52 |
| is_diffuse_nidusTRUE | 1.41 | 0.51 | 0.682 | 0.495 | 0.52 | 3.83 |
| (Intercept) | 2.20 | 1.24 | 0.635 | 0.526 | 0.19 | 25.00 |
Plot the coefficients with their confidence intervals.
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(exponentiate = F, conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
filter(term != "(Intercept)") %>%
arrange(desc(estimate)) %>%
mutate(term = factor(term, levels = term)) %>%
ggplot(aes(x = estimate, y = term, color = term)) +
geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
labs(x = "Odds ratio", y = NULL) +
guides(color = F) +
theme(
axis.text = element_text(size = 12),
axis.title = element_text(size = 13)
)Had surgery
Select features.
# Define p-value threshold
pval_threshold <- 0.2
# Get the predictors of interest
predictors <-
univariable_adjusted_recoded %>%
bind_rows(univariable_unadjusted_recoded) %>%
filter(`P-values` < pval_threshold) %>%
pull(`Predictors`)
# Clean categorical variables
predictors <-
predictors %>%
str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
unique()
# Add adjustment variables
predictors <- unique(c(predictors))
# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]
# Replace SM by its binary variant
predictors <- c(predictors, "is_surgery")
predictors <- predictors[!predictors == "procedure_combinations"]
# Print
print(predictors)[1] "location" "spetzler_martin_grade"
[3] "is_diffuse_nidus" "age_at_first_treatment_yrs"
[5] "is_surgery"
Fit logistic.
# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)
# Create formula
model <- as.formula(paste(
CHOSEN_OUTCOME,
"~",
paste(predictors, collapse = " + ")
))
fit <- glm(
model,
data = df,
family = binomial()
)
# Save
fit_without_grading <- fitPrint results.
# Summarize model coefficients
fit_summary <-
fit %>%
broom::tidy(exponentiate = T, conf.int = T) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
stylized %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| locationHemispheric | 0.20 | 0.47 | -3.433 | 0.001 | 0.08 | 0.50 |
| spetzler_martin_grade | 1.55 | 0.19 | 2.232 | 0.026 | 1.06 | 2.29 |
| locationDeep | 0.32 | 0.53 | -2.182 | 0.029 | 0.11 | 0.88 |
| age_at_first_treatment_yrs | 1.10 | 0.05 | 1.891 | 0.059 | 1.00 | 1.21 |
| is_surgeryTRUE | 1.46 | 0.50 | 0.759 | 0.448 | 0.54 | 3.83 |
| is_diffuse_nidusTRUE | 1.24 | 0.51 | 0.419 | 0.675 | 0.44 | 3.36 |
| (Intercept) | 0.07 | 0.85 | -3.098 | 0.002 | 0.01 | 0.36 |
Create robust CIs using the sandwich estimator.
# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)Stylize results.
# Summarize model coefficients
fit_summary <-
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(exponentiate = F, conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
multivariable_pvalue <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
multivariable_pvalue %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| locationHemispheric | 0.20 | 0.43 | -3.698 | 0.000 | 0.09 | 0.47 |
| locationDeep | 0.32 | 0.51 | -2.244 | 0.025 | 0.12 | 0.86 |
| spetzler_martin_grade | 1.55 | 0.21 | 2.075 | 0.038 | 1.02 | 2.33 |
| age_at_first_treatment_yrs | 1.10 | 0.05 | 1.908 | 0.056 | 1.00 | 1.21 |
| is_surgeryTRUE | 1.46 | 0.51 | 0.740 | 0.460 | 0.54 | 3.97 |
| is_diffuse_nidusTRUE | 1.24 | 0.53 | 0.409 | 0.682 | 0.44 | 3.47 |
| (Intercept) | 0.07 | 1.00 | -2.652 | 0.008 | 0.01 | 0.50 |
Plot the coefficients with their confidence intervals.
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(exponentiate = F, conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
filter(term != "(Intercept)") %>%
arrange(desc(estimate)) %>%
mutate(term = factor(term, levels = term)) %>%
ggplot(aes(x = estimate, y = term, color = term)) +
geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
labs(x = "Odds ratio", y = NULL) +
guides(color = F) +
theme(
axis.text = element_text(size = 12),
axis.title = element_text(size = 13)
)Grade 5s
df_uni %>%
drop_na(spetzler_martin_grade) %>%
count(is_poor_outcome, spetzler_martin_grade, name = "num_patients") %>%
mutate(
pct_patients = scales::percent(num_patients / sum(num_patients), 1)
) %>%
sable()| is_poor_outcome | spetzler_martin_grade | num_patients | pct_patients |
|---|---|---|---|
| FALSE | 1 | 38 | 17% |
| FALSE | 2 | 45 | 20% |
| FALSE | 3 | 65 | 29% |
| FALSE | 4 | 37 | 16% |
| FALSE | 5 | 17 | 7% |
| TRUE | 1 | 2 | 1% |
| TRUE | 2 | 3 | 1% |
| TRUE | 3 | 6 | 3% |
| TRUE | 4 | 7 | 3% |
| TRUE | 5 | 8 | 4% |
Only consider those with SM grade 5.
df_uni %>%
filter(spetzler_martin_grade == 5) %>%
count(is_poor_outcome, spetzler_martin_grade, name = "num_patients") %>%
mutate(
pct_patients = scales::percent(num_patients / sum(num_patients), 1)
) %>%
sable()| is_poor_outcome | spetzler_martin_grade | num_patients | pct_patients |
|---|---|---|---|
| FALSE | 5 | 17 | 68% |
| TRUE | 5 | 8 | 32% |
Honest esitmation (WIP)
https://www.pnas.org/doi/full/10.1073/pnas.1510489113 https://arxiv.org/pdf/1510.04342
https://github.com/grf-labs/grf https://grf-labs.github.io/grf/REFERENCE.html https://cran.r-project.org/web/packages/hettx/vignettes/detect_idiosyncratic_vignette.html
estimate std.err
0.3586 0.0488
estimate std.err
0.4041 0.0506
Write
Setup
Create the necessary directories.
# Get today's date
today <- Sys.Date()
today <- format(today, "%Y-%m-%d")
# Create folder names
base_folder <- file.path(DST_DIRNAME, ANALYSIS_NAME)
date_folder <- file.path(base_folder, today)
csv_folder <- file.path(date_folder, "csv")
pdf_folder <- file.path(date_folder, "pdf")
html_folder <- file.path(date_folder, "html")
# Create folders
suppressWarnings(dir.create(base_folder))
suppressWarnings(dir.create(date_folder))
suppressWarnings(dir.create(csv_folder))
suppressWarnings(dir.create(pdf_folder))
suppressWarnings(dir.create(html_folder))
# Print folder names
print(base_folder)
print(date_folder)
print(csv_folder)
print(pdf_folder)
print(html_folder)[1] "../../outputs//predictive-analytics/complications-minor"
[1] "../../outputs//predictive-analytics/complications-minor/2024-07-29"
[1] "../../outputs//predictive-analytics/complications-minor/2024-07-29/csv"
[1] "../../outputs//predictive-analytics/complications-minor/2024-07-29/pdf"
[1] "../../outputs//predictive-analytics/complications-minor/2024-07-29/html"
Figures
Write all figures.
Tables
Write all tables.
# # Arguments
# df <- tables$desc_stats_cohorts_cv_prolaio
# filepath_csv <- file.path(csv_folder, "desc-stats_by-cohort_cov.csv")
# filepath_html <- file.path(html_folder, "desc-stats_by-cohort_cov.html")
#
# # Save as CSV
# write_csv(
# x = df,
# file = filepath_csv
# )
#
# # Save as HTML
# # - Save pdf does not work with webshot
# # - It works with pagedown but not as pretty as desired
# df %>%
# sable() %>%
# kableExtra::save_kable(file = filepath_html)Reproducibility
Linting and styling
Dependency management
Documentation
Session Info
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;
LAPACK version 3.11.0
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] grf_2.3.2 lubridate_1.9.3 forcats_1.0.0
[4] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[7] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[10] tidyverse_2.0.0 selectiveInference_1.2.5 MASS_7.3-60
[13] adaptMCMC_1.5 coda_0.19-4.1 survival_3.5-7
[16] intervals_0.15.4 glmnet_4.1-8 Matrix_1.6-1.1
[19] magrittr_2.0.3 ggplot2_3.5.0 kableExtra_1.4.0
[22] rmdformats_1.0.4 knitr_1.45
loaded via a namespace (and not attached):
[1] pROC_1.18.5 tcltk_4.3.2 sandwich_3.1-0
[4] rlang_1.1.3 e1071_1.7-14 matrixStats_1.3.0
[7] compiler_4.3.2 systemfonts_1.0.5 vctrs_0.6.5
[10] reshape2_1.4.4 cvAUC_1.1.4 pkgconfig_2.0.3
[13] shape_1.4.6.1 crayon_1.5.2 summarytools_1.0.1
[16] fastmap_1.1.1 backports_1.4.1 magick_2.8.3
[19] labeling_0.4.3 pander_0.6.5 utf8_1.2.4
[22] rmarkdown_2.25 prodlim_2023.08.28 tzdb_0.4.0
[25] bit_4.0.5 xfun_0.42 cachem_1.0.8
[28] jsonlite_1.8.8 recipes_1.0.10 highr_0.10
[31] pryr_0.1.6 broom_1.0.6 R6_2.5.1
[34] bslib_0.6.1 stringi_1.8.3 parallelly_1.37.1
[37] rpart_4.1.21 lmtest_0.9-40 jquerylib_0.1.4
[40] Rcpp_1.0.12 bookdown_0.39 assertthat_0.2.1
[43] iterators_1.0.14 future.apply_1.11.2 zoo_1.8-12
[46] pscl_1.5.9 base64enc_0.1-3 nnet_7.3-19
[49] splines_4.3.2 timechange_0.3.0 tidyselect_1.2.1
[52] rstudioapi_0.15.0 yaml_2.3.8 timeDate_4032.109
[55] codetools_0.2-19 listenv_0.9.1 lattice_0.21-9
[58] precrec_0.14.4 plyr_1.8.9 withr_3.0.0
[61] ROCR_1.0-11 evaluate_0.23 future_1.33.2
[64] proxy_0.4-27 xml2_1.3.6 BiocManager_1.30.22
[67] pillar_1.9.0 renv_1.0.4 stats4_4.3.2
[70] checkmate_2.3.1 foreach_1.5.2 generics_0.1.3
[73] vroom_1.6.5 hms_1.1.3 munsell_0.5.0
[76] scales_1.3.0 globals_0.16.3 class_7.3-22
[79] glue_1.7.0 tools_4.3.2 data.table_1.15.4
[82] ModelMetrics_1.2.2.2 gower_1.0.1 rapportools_1.1
[85] grid_4.3.2 ipred_0.9-14 colorspace_2.1-0
[88] nlme_3.1-163 patchwork_1.2.0 cli_3.6.2
[91] fansi_1.0.6 viridisLite_0.4.2 lava_1.8.0
[94] svglite_2.1.3 gtable_0.3.4 ggcorrplot_0.1.4.1
[97] sass_0.4.8 digest_0.6.34 caret_6.0-94
[100] farver_2.1.1 htmltools_0.5.7 lifecycle_1.0.4
[103] hardhat_1.4.0 bit64_4.0.5
References
[[1]]
Scheidegger A, , (2024). _adaptMCMC: Implementation of a Generic
Adaptive Monte Carlo Markov Chain Sampler_. R package version 1.5,
<https://CRAN.R-project.org/package=adaptMCMC>.
[[2]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.
[[3]]
Plummer M, Best N, Cowles K, Vines K (2006). "CODA: Convergence
Diagnosis and Output Analysis for MCMC." _R News_, *6*(1), 7-11.
<https://journal.r-project.org/archive/>.
[[4]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.
[[5]]
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A
Grammar of Data Manipulation_. R package version 1.1.4,
<https://CRAN.R-project.org/package=dplyr>.
[[6]]
Wickham H (2023). _forcats: Tools for Working with Categorical
Variables (Factors)_. R package version 1.0.0,
<https://CRAN.R-project.org/package=forcats>.
[[7]]
Wickham H (2016). _ggplot2: Elegant Graphics for Data Analysis_.
Springer-Verlag New York. ISBN 978-3-319-24277-4,
<https://ggplot2.tidyverse.org>.
[[8]]
Friedman J, Tibshirani R, Hastie T (2010). "Regularization Paths for
Generalized Linear Models via Coordinate Descent." _Journal of
Statistical Software_, *33*(1), 1-22. doi:10.18637/jss.v033.i01
<https://doi.org/10.18637/jss.v033.i01>.
Simon N, Friedman J, Tibshirani R, Hastie T (2011). "Regularization
Paths for Cox's Proportional Hazards Model via Coordinate Descent."
_Journal of Statistical Software_, *39*(5), 1-13.
doi:10.18637/jss.v039.i05 <https://doi.org/10.18637/jss.v039.i05>.
Tay JK, Narasimhan B, Hastie T (2023). "Elastic Net Regularization
Paths for All Generalized Linear Models." _Journal of Statistical
Software_, *106*(1), 1-31. doi:10.18637/jss.v106.i01
<https://doi.org/10.18637/jss.v106.i01>.
[[9]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.
[[10]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.
[[11]]
Tibshirani J, Athey S, Sverdrup E, Wager S (2024). _grf: Generalized
Random Forests_. R package version 2.3.2,
<https://CRAN.R-project.org/package=grf>.
[[12]]
Bourgon R (2023). _intervals: Tools for Working with Points and
Intervals_. R package version 0.15.4,
<https://CRAN.R-project.org/package=intervals>.
[[13]]
Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and
Pipe Syntax_. R package version 1.4.0,
<https://CRAN.R-project.org/package=kableExtra>.
[[14]]
Xie Y (2023). _knitr: A General-Purpose Package for Dynamic Report
Generation in R_. R package version 1.45, <https://yihui.org/knitr/>.
Xie Y (2015). _Dynamic Documents with R and knitr_, 2nd edition.
Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963,
<https://yihui.org/knitr/>.
Xie Y (2014). "knitr: A Comprehensive Tool for Reproducible Research in
R." In Stodden V, Leisch F, Peng RD (eds.), _Implementing Reproducible
Computational Research_. Chapman and Hall/CRC. ISBN 978-1466561595.
[[15]]
Grolemund G, Wickham H (2011). "Dates and Times Made Easy with
lubridate." _Journal of Statistical Software_, *40*(3), 1-25.
<https://www.jstatsoft.org/v40/i03/>.
[[16]]
Bache S, Wickham H (2022). _magrittr: A Forward-Pipe Operator for R_. R
package version 2.0.3, <https://CRAN.R-project.org/package=magrittr>.
[[17]]
Venables WN, Ripley BD (2002). _Modern Applied Statistics with S_,
Fourth edition. Springer, New York. ISBN 0-387-95457-0,
<https://www.stats.ox.ac.uk/pub/MASS4/>.
[[18]]
Bates D, Maechler M, Jagan M (2023). _Matrix: Sparse and Dense Matrix
Classes and Methods_. R package version 1.6-1.1,
<https://CRAN.R-project.org/package=Matrix>.
[[19]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.
[[20]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.
[[21]]
Wickham H, Henry L (2023). _purrr: Functional Programming Tools_. R
package version 1.0.2, <https://CRAN.R-project.org/package=purrr>.
[[22]]
Wickham H, Hester J, Bryan J (2024). _readr: Read Rectangular Text
Data_. R package version 2.1.5,
<https://CRAN.R-project.org/package=readr>.
[[23]]
Barnier J (2022). _rmdformats: HTML Output Formats and Templates for
'rmarkdown' Documents_. R package version 1.0.4,
<https://CRAN.R-project.org/package=rmdformats>.
[[24]]
Tibshirani R, Tibshirani R, Taylor J, Loftus J, Reid S, Markovic J
(2019). _selectiveInference: Tools for Post-Selection Inference_. R
package version 1.2.5,
<https://CRAN.R-project.org/package=selectiveInference>.
[[25]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.
[[26]]
Wickham H (2023). _stringr: Simple, Consistent Wrappers for Common
String Operations_. R package version 1.5.1,
<https://CRAN.R-project.org/package=stringr>.
[[27]]
Therneau T (2023). _A Package for Survival Analysis in R_. R package
version 3.5-7, <https://CRAN.R-project.org/package=survival>.
Terry M. Therneau, Patricia M. Grambsch (2000). _Modeling Survival
Data: Extending the Cox Model_. Springer, New York. ISBN 0-387-98784-3.
[[28]]
Müller K, Wickham H (2023). _tibble: Simple Data Frames_. R package
version 3.2.1, <https://CRAN.R-project.org/package=tibble>.
[[29]]
Wickham H, Vaughan D, Girlich M (2024). _tidyr: Tidy Messy Data_. R
package version 1.3.1, <https://CRAN.R-project.org/package=tidyr>.
[[30]]
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E,
Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi
K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the
tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
[[31]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.